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Abstract 

This paper studies how the effectiveness of teachers varies by classroom composition. We combine random 
assignment of teachers to classrooms with rich measures of teacher effectiveness based on a popular 
observational protocol, Framework for Teaching, to overcome key endogeneity concerns related to measurement 
and matching. We find that complementarities between classroom composition and teaching practice play a 
significant role in student achievement. We identify two main mechanisms that are driving this result: 1) negative 
interactions between challenging and/or student-centered practices and heterogeneity in classroom prior 
achievement, and 2) positive interactions between classroom management skills and average classroom prior 
achievement. Our findings illustrate the multidimensional nature of teacher effectiveness and have important 
implications for prescribing teaching practice and evaluating teachers. Simulations show that teacher rankings 
change substantially simply from within-school classroom reallocations, suggesting the need for caution when 
using popular teaching evaluation rubrics in high-stakes settings. 
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1 Introduction 


Teachers are largely understood to be the most important school-level determinant of achievement. 
Dating at least back to the Coleman Report of 1966, school peers have also been viewed as a key 
factor. While a vast literature has contributed to significant improvements in our understanding of 
the roles of teachers and peers (Gamoran et al., 2000; Rivkin et al., 2005; Sacerdote, 2011; Epple 
and Romano, 2010), a significant limitation of this literature is the treatment of teachers and peers 
in isolation. We address this gap by studying complementarities between teachers and classroom 
composition. The existence of complementarities has important implications for key teacher-related 
policies, including the importance of taking classroom composition into account when measuring 
teacher effectiveness and prescribing teaching practice. It further pushes the envelope on peer- 
effect-related policies beyond the often stark tradeoffs of regrouping students to the question of 
better matching teachers or teaching practice to different classroom compositions. 

Why might complementarities between teachers or teaching practice and classroom composition 
be an important part of achievement production? One example is that the benefits of encouraging 
classroom discussion may vary depending on the heterogeneity in initial achievement of a student’s 
classmates. Furthermore, teachers can play a central role in determining the nature of classroom 
peer interactions. For instance, peer effects could be amplified by teaching practices that create a 
positive learning environment and promote a learning dialogue among students.! 

Two important barriers have hindered the exploration of complementarities between teachers 
and classroom composition. First, detailed longitudinal data on teaching practices on a large scale 
are relatively rare. Second, endogeneity concerns related to nonrandom allocation of teachers to 
classrooms has posed significant challenges to identification. We overcome these challenges by 
exploiting a unique data set—the Measures of Effective Teaching (MET) Longitudinal Database. 
The key features of the data are rich information on teaching practices in a context where teachers 
are randomly assigned to classrooms. Teachers are evaluated by trained raters using a research- 


based protocol that is increasingly used to measure teaching effectiveness in schools nationwide, 


‘Teaching practices not only involve the principles and methods used for instruction (e.g. class discussions vs. 
recitation), but also those actions that affect the social dynamics of a given classroom (e.g. classroom management). 


the Framework for Teaching Evaluation Instrument (FFT) (Danielson, 2011).? For classroom 
composition, we focus on classroom peer initial achievement, the most-studied type of peer spillover 
in the literature (Sacerdote, 2011). 

The random assignment of teachers eliminates one of the most important confounding factors 
for measuring teacher effectiveness, the systematic matching of students to teachers that would lead 
us to confound teachers or peer effects with unobservable teacher or peer quality. However, even 
with random assignment, our identification strategy needs to address a number of remaining endo- 
geneity concerns. The first is that there is considerable non-compliance in the data. We address 
this by relying on the variation generated by the randomly-assigned teacher rather than the actual 
teacher. Second, classroom composition may be endogenous. We apply a result in Bun and Harri- 
son (2018) and Nizalova and Murtazashvili (2014) to show that the random assignment of teachers 
to classrooms is enough to obtain consistent estimates of the complementarities between teaching 
practice and classroom composition as long as students do not re-sort to classrooms in response to 
teachers, which we can test using initially-assigned classrooms.’ Third, if teachers choose practices 
to maximize student achievement, the observed teaching practice could be endogenous to the class- 
room composition. We address this primarily by focusing on prior year teaching practices, thus 
capturing teachers’ proclivity toward certain practices. Fourth, teaching practice is measured with 
error. We exploit multiple measures of teaching practice and use factor models to identify what 
aspects are separable in the data. We rely primarily on averages of multiple measures of teaching 
practices to address measurement error, but show robustness to a number of other approaches.* 

We ground our empirical strategy in a simple theoretical model of student behavior, which helps 
inform the structure of the estimating equations and illustrates the potential pervasiveness of the 
complementarities in teaching practice and classroom composition. We show that even when the 


learning production function does not directly depend on the interaction between teaching practice 


Kane et. al. (2011) shows the importance of this teacher evaluation protocol in an observational context. 

3Balancing tests support that classroom composition is indeed random within randomization blocks and that 
similar results hold when we use initially assigned classrooms, before students had an opportunity to re-sort based 
on teachers. 

“These include instrumenting contemporaneous teaching practices with prior practice and adapting the estimation 
approach developed by Hausman et al. (1991) for nonlinear error in variables models to apply to our setting, a panel 
model where the nonlinearity takes the form of complementarities. 


and peer initial achievement, a complementarity between teachers and peers could emerge indirectly 
through students’ endogenous responses to teaching practices. One such example is when teachers 
with better classroom management practices make engagement less costly, and students benefit 
more from their peers if they are engaged with the material. 

Our main finding indicates that complementarities between classroom composition and teaching 
practice play a key role in student achievement. More specifically, we identify two main mechanisms 
that are driving this result. While certain subdomains of FFT, which we label as classroom manage- 
ment practices, interact positively with average peer initial achievement; other subdomains, which 
we label as challenge/student-centered practices, interact negatively with the interquartile range 
of prior achievement. Moreover, FFT does not emerge as an important predictor of achievement 
until classroom composition is taken into account. Finally, we illustrate the importance of comple- 
mentarities in determining teachers contribution for measuring teacher effectiveness through some 
simulations. We show that re-allocations of teachers across classrooms within school has signifi- 
cant effects on measures of teachers’ contributions to learning. This points to the need for caution 
particularly in implementing high-stakes policies such as those that aim to replace the worst 5% to 
10% of teachers with average teachers (Chetty et al., 2014; Hanushek, 2011).° 

The identification of the mechanisms behind our main finding is not simple given that measured 
teaching practice may be correlated with some unobservable aspects of the teacher. This is an issue 
that all the literature that seeks to evaluate characteristics of effective teaching shares.® To address 
this concern, we explore whether our findings are driven instead by some other correlated teacher 
attribute. First, we show that after combining our teaching practices in a single specification our 
key results become stronger than specifications that include them separately, suggesting that (if 
any) omitted variable bias is attenuating our main results. Second, we show that our results are 
robust to controlling for an unusually rich set of teacher quality measures, including principal and 


student surveys along with a teaching knowledge assessment.’ 


Section 7 describes how we define teacher contribution to learning. 

°For instance, see Araujo et al. (2016) and Taylor (2018) for discussions of this challenge. 

Taylor (2018) also shows that different type of instructional methods play an important role on student achieve- 
ment beyond just teaching skills. Although educational researchers make an important distinction between teacher 
quality and teaching quality (Hamilton, 2012; Kennedy, 2010), we use the term “teacher” here, assuming the teacher 
knowledge measures reflect relatively stable traits. 


We make several important contributions to the literature. First, we demonstrate how failing to 
capture the heterogeneity in the effectiveness of teaching practice by classroom composition leads 
us to understate the importance of measures of teacher effectiveness and even, in some cases, to 
infer that the practice does not matter when in fact the effects are sizable in certain classrooms. 
This provides insight into why observable teacher measures may often do a poor job of capturing 
teacher quality (e.g. Rivkin et al., 2005). From a policy perspective then, understanding this type 
of heterogeneity is crucial for identifying what teaching practices matter and in what classroom 
contexts. 

Second, our research connects closely to a number of recent studies that consider heterogeneity 
in teacher effectiveness by student background characteristics (Lavy, 2015; Fox, 2016; Konstan- 
topoulos, 2009).® However, by focusing on heterogeneity by classroom composition, our work is 
substantively different in focus. Furthermore, we show that heterogeneity by classroom composition 
seems to be of significantly larger magnitudes than heterogeneity by a student’s prior achievement. 

Third, our study also provides useful complementary evidence to the value-added literature 
which argues fairly persuasively that teachers matter (Jackson et al., 2014; Koedel et al., 2015; 
Rivkin et al., 2005; Chetty et al., 2014; Rothstein, 2010). Consistent with our central hypothesis 
that teacher effectiveness varies with who the teacher teaches, interesting recent work by Stacy et 
al. (2013) shows that value-added estimates are significantly more stable year-to-year for teachers 
of students with higher-initial achievement. The most closely related work is an innovative paper 
by Jackson (2013), which demonstrates a significant role for match quality between teachers and 
schools. While the econometric issues associated with allowing estimated teacher value-added to 
vary by classroom composition are also of interest, illustrating the potential variation in teacher 
effectiveness using teacher evaluation protocols is a natural starting point, particularly given the 
increasing importance of these protocols for schools. Another key benefit of using the protocols 
is to examine teacher effectiveness as a multidimensional construct, which proves central to our 


analysis and is less straight-forward from a value-added perspective. 


°For instance, Lavy (2015) finds larger effects of challenge/student-centered teaching for girls and low-SES students. 
Connor et al. (2004) show larger effects of some types of challenge/student-centered practices for children with higher 
initial achievement. Finally, Konstantopoulos (2009) finds somewhat larger effects of teacher effectiveness for high- 
SES students. 


A number of other studies have used the MET data to identify effective teachers. Already studies 
from the MET project have generated important insights (Cantrell and Kane, 2013). For instance, 
Kane et al. (2013) verify that value-added metrics can be effective ways of evaluating teacher 
effectiveness in observational data and that multiple metrics of teacher effectiveness, including 
observations of practice, further improve understanding of a teachers’ underlying effectiveness. 
Mihaly et al. (2013) show that the different metrics of teacher effectiveness (value-added, classroom 
observation video scores and student survey reports) have important commonalities. Araujo et 
al. (2016) and Bacher-Hicks et al. (2017), in different settings, also illustrate the importance of 
teacher observation protocols for measuring teacher effectiveness. In the present study, we shift 
the emphasis from identifying effective teachers to analyzing whether teachers who display higher 
skills in certain dimensions are more effective in certain types of classrooms. 

Fourth, our paper also contributes to the literature on peer effects. The literature has consid- 
ered fairly extensively how peer effects vary by student background characteristics because of the 
important implications of this type of heterogeneity to tracking and desegregation policies (For 
instance, see Burke and Sass, 2006; Fruehwirth, 2013; Gibbons and Telhaj, 2006; Hanushek et al., 
2009; Hanushek and Rivkin, 2009; Hoxby and Weingarth, 2005; Lavy et al., 2012, among others). 
Zimmer (2003) and Duflo et al. (2011) consider heterogeneity by student prior achievement and by 
whether the school tracks or not, which relates to the present study in interesting ways. A handful 
of recent studies of peer effects may be driven by how the teacher adapts or targets her teaching 
(Jackson, 2016; Duflo et al., 2011; Lavy et al., 2012; Lee et al., 2014). None of these estimate 
complementarities between peer composition and teaching practice. The strong complementarities 
between peer initial achievement and teaching practice also suggest that failure to account for this 
leads us to understate the importance of peers. 

The rest of the paper proceeds as follows. We first describe the data in Section 2, including 
our measures of teaching practice. Section 3 presents our theoretical framework and Section 4 our 
empirical strategy. Section 5 presents our main findings, followed by an analysis of the possible 
mechanisms behind our main results in Section 6. Section 7 performs simulation exercises that 


study how reallocation of teachers into classrooms affects their contribution to learning and their 


relative rankings. Finally, Section 8 concludes. 


2 Data 


The Measures of Effective Teaching (MET) Longitudinal Database provides detailed information 
on teaching practices, student outcomes, and classroom composition from six large urban public 
school districts in the United States over two academic years (2009-2010 and 2010-2011).? The data 
are linked to district administrative records, which include detailed student information, most im- 
portantly, current and prior measures of student achievement, but also age, race/ethnicity, gender, 
special education status, free lunch eligibility, gifted status, and English language learner status. 
The data also include rich measures related to teacher aptitude, such as the Content Knowledge for 
Teaching (CKT) assessment, and school principal evaluations.!° Finally, a key aspect of the MET 
data is that teachers were randomly assigned within school and grade to classrooms of students 
during the second academic year of the study (2010-2011).1! 

We analyze students’ math performance because it has traditionally been shown to be more 
malleable to school inputs. Moreover, we focus on elementary school students (grades four and five) 
given that most of them are taught by general elementary teachers in self-contained classrooms with 


more concentrated exposure to the same peers and teachers. !? 


°These districts include New York City Department of Education, Charlotte-Mecklenburg Schools, Denver Public 
Schools, Memphis City Schools, Dallas Independent School District, and Hillsborough County Public Schools. Kane 
and Staiger (2012) provides a detailed description on how schools were selected to participate in the MET project. 
More importantly, Kane and Staiger (2012) argues that MET teachers are comparable by most measures to their 
non-MET peers in the district, suggesting that they are representative of the districts included. 

The purpose of the CKT math assessment is to measure knowledge tied to the teaching of mathematics, such 
as: choosing and using appropriate mathematical representations; choosing examples to illustrate a mathematical 
concept; interpreting student work, including use of nonstandard strategies; and evaluating student understanding. 

"When schools joined the MET study in 2009-2010, principals were asked to identify groups of teachers that 1) 
were teaching the same subject to students in the same grade, 2) were certified to teach common classes and, 3) 
were expected to teach the same subject to students in the same grade the following year. These groups of teachers 
were called “exchange groups.” The plan was for principals to create class rosters as similar as possible within an 
exchange group, and then send these rosters to MET to be randomly assigned to “exchangeable” teachers. One issue 
in practice was that, when it came time to perform the randomization, not all teachers within an exchange group 
were able to teach during a common period. As a result, randomization was performed within subsets of exchange 
groups called “randomization blocks”. 

"2 Appendix A provides a detailed description of the sample selection. 


2.1 Measuring Teaching Practice 


We make use of a well-known, research-based classroom observation protocol that measures teach- 
ing practices, the Framework for Teaching (FFT). Increasingly school districts have begun to use 
these types of protocols for teacher evaluation purposes and FFT is the most popular (AIR, 2013). 
According to MET project (2010b), “FFT has been subjected to several validation studies over the 
course of its development and refinement, including an initial validation by Educational Testing 
Service (ETS).”'° The protocol divides teaching into four domains and the MET database rates 
teachers on two of them: classroom environment and instruction. We observe scores for eight dif- 
ferent subdomains of these two domains by a median of seven different highly trained, independent 


14 These raters had to pass reliability tests in 


raters, many of them current or former teachers. 
which their scores were compared with master scores on a number of videos. This provides some 
assurance of the quality of these observational data and help us to address measurement error, as 
we discuss further in Section 4. 

Though FFT was designed so that each subdomain represents a separate aspect of teaching 
practice, we perform an exploratory factor analysis to determine the number of components that 
are actually separable in the data. Appendix Table 7 shows the correlations between the different 
subdomains and the loadings on each subdomain after performing an oblique rotation of the fac- 
tors.!° This analysis suggests that FFT measures can be divided into two separable broad teaching 
practices. There are five sub-scales which load heavily on the first factor, including establishing a 
culture of learning, communicating with students, engaging students in learning, using assessment 
in instruction and using questioning and discussion techniques. These all reflect what we will call 


challenge/student-centered practices that encourage classroom dialogue and student involvement.!° 


'8Of the MET observation protocol, two, FFT, and CLASS are generic protocols designed to apply across instruction 
in a range of subject-matters. In our view, of these, FFT has the most comprehensive architecture capturing teaching 
practices. 

“The score assigned to each component ranges between 1 and 4, where each each number refers to a level (1:unsatis- 
factory, 2:basic, 3:proficient, 4:distinguished). Appendix Table 6 provides a description of each of the sub-components 
of the FFT protocol. 

The results reported take the average across raters so that there is one observation per component per teacher. 
Results are similar if we perform the exploratory factor analysis at the level of the rater or if we use orthogonal 
rotations. They are also similar if we extract rater fixed effects and video quality prior to performing the factor 
analysis. 

'®We have chosen the term “challenge/student-centered practices” to try to capture the overall emphasis of the 


The subdomains that load on the second factor are creating an environment of respect and rapport, 
managing student behaviors and managing classroom procedures. We will refer to these as class- 
room management practices, as they all relate to teaching practices that lead to a better classroom 
environment. Taken together the factors explain 92% of the total variance in the data.!” 

As a final robustness check, we also implemented confirmatory factor analysis with the aim to 
establish whether the proposed grouping of the FFT subdomains provides a better fit of the data 
than alternative models. First, we compare our model with a competing specification in which 
all the FFT subdomains load in only one latent factor. Second, we test our classification with 
the grouping that has been predetermined in the FFT protocol (i.e., classroom environment and 
instruction domains).'® In both cases, the Bayesian information criterion (BIC) indicates that our 
proposed classification provides a better fit of the data.’® Our empirical strategy will mainly make 
use of averages across the sub-scales that according to the exploratory factor analysis correspond 
to each broad practice (i.e., classroom management and challenge/student-centered practices), but 


we also explore other ways of addressing measurement error, as described in detail in Section 4.7° 


2.2 Summary Statistics 


Table 1 reports summary statistics for characteristics of the students in our final sample.”! This is 
a racially-diverse sample; 31% of students are black, 25% are white, 29% are Hispanic, and 11% are 


Asian, indicating that the school districts included in our data are not necessarily representative 


model items. Many of the FFT domains entail elements of student-centered instruction (e.g., in the engaging students 
in learning domain, “students identify or create their own materials for learning”). Yet, it is important to note that 
the FFT protocol is well balanced with “challenge” items (e.g. the first indicator of proficiency in the questioning 
and discussion techniques sub-domain is “questions of high cognitive challenge” (Danielson, 2011).) 

'’ An initial exploratory factor analysis shows that there is only one eigenvalue greater than 1, a possible rough 
rule of thumb for determining the number of factors. However, one factor explains 0.79 of the variation and a second 
factor explains a substantial additional part, 0.13, which is an additional criteria used to determine the number of 
factors. 

18 Classroom environment includes: environment of respect and rapport, establishing a culture for learning, man- 
aging student behaviors, and managing classroom procedures. While instruction includes: communicating with 
students, engaging students in learning, using assessment in instruction, and using questioning and discussion tech- 
niques. 

© This analysis has been performed using the “confa” command in Stata, which deals with problems of identification 
in factor models (Kolenikov, 2009). 

?°We also replicated our empirical strategy using principal component and following the FFT classification as 
alternative measures of challenge/student-centered and classroom management practices. Results in all cases are 
similar. 

21 A ppendix Table 8 shows summary statistics of the full randomization sample prior to any sample restrictions. 


Table 1: Summary Statistics: Sample (N=2632) 


Mean o Min Max 
Grade Level 4.50 0.50 4.00 5.00 
Joint Math and ELA Class 0.87 0.33 0.00 1.00 
Age 9.40 0.92 7.02 12.20 
Male 0.50 0.50 0.00 1.00 
Gifted 0.05 0.21 0.00 1.00 
Special Education 0.08 0.27 0.00 1.00 
English Language Learner 0.16 0.36 0.00 1.00 
White 0.25 0.43 0.00 1.00 
Black 0.31 0.46 0.00 1.00 
Hispanic 0.29 0.45 0.00 1.00 
Asian 0.11 0.31 0.00 1.00 
American Indian 0.01 0.08 0.00 1.00 
Race Other 0.03 0.17 0.00 1.00 
Race Missing 0.00 0.07 0.00 1.00 
Math Score (Year 09-10) -0.00 0.90 -2.84 2.73 
Math Score (Year 10-11) 0.04 0.90 -3.26 3.01 
Unique Districts 5 - - - 
Unique Classes 147 - - - 
Unique Schools 39 - - - 
Unique Randomization Blocks 57 - - - 
Unique Teachers 147 - - - 


Percentage of Class w/ 09-10 


Math Scores 0.91 0.07 0.67 1.00 


Percentage of Class in Ran- 


dor Aesienmient 0.78 0.14 0.32 1.00 


Teachers per Randomization 
Block 


Randomization Block Compli- 
ance Rate 


2.86 0.83 2.00 4.00 


0.93 0.09 0.50 1.00 


Notes: See Appendix A for a description of how this sample was obtained. Joint 
Math/ELA Class refers to a self-contained course in which students learn both math 
and ela, the remaining courses are either math or ela only. We summarize the percent- 
age of each class w/ prior math test scores since students new to the district will not 
have prior test scores. We also summarize the percentage of each class in randomiza- 
tion because not all students in the classes we observe were on the original randomly 
assigned class rosters. 10 


of the whole US population of students. The bottom part of Table 1 further characterizes the data 
by displaying the number of districts (5), schools (39), teachers (147), and randomization blocks 
(57) in our final sample. 

Table 2 displays summary statistics corresponding to the the FFT domains and classroom prior 
achievement average and inter-quartile range (IQR) in prior achievement.2” The last two columns of 
Table 2 show standard deviations within and between randomization blocks. We find considerable 


within-randomization block variation in teaching practice and classroom composition. 


3 Model 


We motivate here how interactions between teaching practice and peer initial achievement arise 
through a number of intuitive mechanisms. The simplest model has these interactions arising 
through the production technology. This makes sense for a number of possible teaching practices. 
For instance, encouraging classroom discussion would create more of a team production climate 
where peers matter more for each student’s achievement. Alternatively, for some practices, teacher 
practice could enter indirectly to the achievement production function through students’ behavioral 
responses (e.g., engagement, attentiveness). In this case, complementarities would arise if good 
behavior changes whether students benefit from their peers. For instance, classroom management 
practices could help ensure the necessary behavior to create a good learning environment. While the 
production technology channel is straightforward, it is helpful to illustrate the behavioral channel 
with a simple model. The model also informs the empirical specification we take to the data.?? 
Let Yj, denote achievement of a student i at time ¢t. Let the index c¢ = c(i,t) denote i’s 
classroom in period ¢ and then the vector of classroom peer achievement excluding 7 is denoted 


Y_iet = (Vit, ---, Yi-at, Yizat, +, Ywe). A student’s class has a teacher indexed j = j(7,t) who uses 


2We use IQR (i.e. difference in test score performance between the 75th and 25th percentile students in a given 
class) to measure classroom heterogeneity rather than standard deviation due to the fact that IQR is less sensitive 
to the presence of outliers, which is a particular concern in a context where classrooms could be small in size. 
Nevertheless, our main specifications presented in columns (1) and (2) of Table 5 are robust to replacing IQR with 
the standard deviation. 

2 We take the teaching practice as given in order to focus on student responses. We can identify most convincingly 
the effects of a fixed or persistent aspect of teaching practice and postpone considering the endogenous response of 
teachers to the classroom composition in future work. 


11 


Table 2: Within and Between-Randomization Block Variation in Classroom Measures 


Std oe iS 
Mean D : Min Max a Dev. 
ia * Within 
tween 

Classroom Composition 
Avg Peer Math;_, 0 1 -2.31 3 0.84 0.58 
IQR Peer Math;_1 0 1 -2.45 2.92 0.78 0.69 
Avg Peer Math;_; (random) 0 1 -2.75 3.02 0.84 0.57 
IQR Peer Math;_; (random) 0 1 -2.34 4.15 0.78 0.7 
Teaching Practices 
Challenge /Student-Centered 0 1 -3.06 2:23 0.74 0.69 
Classroom Management 0 1 -3.15 2.25 0.74 0.63 
FFT Subdomains of Challenge/Student-Centered 
Using duestioning aud iecus (951, . gae “aos 3.25 0.27 0.25 
sion techniques 
aia A CUNICOLAEE 565) aa: ube 3.5 0.27 0.21 
Communicating with students 2.68 0.33 2 3.33 0.24 0.24 
Engaging students in learning 2.54 0.34 1.67 3.5 0.23 0.26 
Ge eee ae ee ae 3.5 0.27 0.26 
tion 
FFT Subdomains of Classroom Management 
Managing student behaviors 2.81 0.36 1.67 3.5 0.25 0.24 
MENG os CATOON DIOKES Sea <jey Ge 3.5 0.27 0.25 
dures 
treats, apy Caonmeny.20Re cog eek . ALB? 3.5 0.24 0.23 


respect & rapport 


Notes: The sample size is 2632 and focuses on 2010-11 school year when students were randomly assigned 
within randomization blocks. Teaching practices are measures in t — 1 based on FFT. The last two 
columns decompose the standard deviation for each variable into between randomization block and 
within randomization block components. 


12 


teaching practice(s) Pj. We begin with a value-added model where achievement production is a 
function of prior achievement, some moment of the prior achievement distribution of their time t 
classmates (m(Y_ic,t-1)). We introduce student behavior, b:, which we conceptualize broadly as 
behaviors conducive to achievement, such as attentiveness, engagement and/or effort. Achievement 
production includes direct interactions between teaching practice and classroom composition and 
the possibility of an indirect channel by allowing the marginal benefits of behavior to vary by the 


classroom composition, i.e., 


Yie = Bo + Bodie + Boybit Yie-1 + Pogdiem(Y_—icyt_-1) + ByYa—1 + Bg m(Y—icqgt-1)+ 


+ BoP; + BoyPiYit—1 + BpgPim(Y-ict-1) + et, (1) 


where ¢;, denotes the residual. Note that this form of the achievement production function is 
comparable to the classic achievement production function with peer spillovers that is generally 
the focus of the literature when m(Y_ic,t-1) is equal to average peer prior achievement (Sacerdote, 
2011), though it is augmented with controls for measures of teaching practice and teaching practice 
interacted with peers. This is understood to capture the reduced-form effect of peers, inclusive 
of both endogenous (effects arising through simultaneity in achievement) and contextual effects 
(arising through direct spillovers from peer prior achievement). Other models in Appendix B 
consider potential teacher effects by changing the behavior of both the student and her peers, but 
we begin here as it is the most straightforward model to connect with the literature. 

Students choose their behavior to maximize their expected utility from achievement net of the 
costs of behavior. To introduce a role for teaching practice in affecting behavior, we also permit 


that the marginal utility/cost of behavior varies with the practice, i.e., 
Yo 
Vit = WyYit — yt + YopP; dit. 
Student utility-maximizing behavior 6;, is then 


* . b 
ia 5 oh + CYP Beg Y yeas )) ae ae 
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Behavior is increasing in initial achievement, peer initial achievement and importantly teaching 
practice. Classroom management practices may affect behavior directly through minimizing op- 
portunities for disruptive behavior, whereas challenge/student-centered practices might do so by 
better engaging students in learning. 

We cannot estimate (1) directly because we do not observe behavior. Instead, we assume that 
the achievement we observe in the data is coming through student optimizing behavior. To obtain 
the achievement production we can take to the data, we plug in for utility-maximizing behavior to 


obtain the following reduced form 


* a Yb Yb 
it = Bo + (Oo Da Bp) P; + (Bo ae Pg) P7m(Y_set—1) a (28nn86~¥ + Bg )m(Y—iet—1)+ 


%: ¥: Yb 
A Beg M(Y—icet-1)” T (By = 2By 3" )¥ie-1 Te Bey - Yi (Boy 7 Boy) PiYieat 


i 2B ry Soy" Vir1m(Y iow) + Git, 


=9 + OpP; + ApyPim(Y_icyt—1) + agm(Y-icst—1) + og2™(Y_icgt—1)* + QyYit-1 + ay2V2_1 
(2) 


PG l eat og ae see) ee 


Note that even if 6, Boy Bg 0, so that teaching practice does not affect achievement 


directly and, more importantly, does not have direct complementarities with peer achievement, this 
specification illustrates how we would also get complementarities from the indirectly behavioral 
channel. This relies on two intuitive conditions. First, student behavior is affected by practice 
(Bo) # 0). Second, the achievement spillovers from peers vary with behavior (3 4 0). Note 
that in this model, the spillovers from peer prior achievement arise both directly through the 
production technology and indirectly through peer effects on the unobserved behavior of the student 
coming from the marginal return of behavior varying with peer prior achievement. In Appendix 
B, we discuss some alternative forms of the behavioral model which could also underlie these 
complementarities, including popular conformity-style models (Brock and Durlauf, 2001; Epple 


and Romano, 2010) or the classic treatment of the classroom environment as a congestible public 
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good (Lazear, 2001). 


4 Estimation 


Our empirical strategy focuses on estimation of the reduced form model described in equation (2), 


which relates most closely to models estimated in the literature. We take as a starting point that 


TAY got) = Y. ic,t-1 and expand to include the IQR of the peer initial achievement distribution 


in the application, i.e., 


7 7 v2 2 
Yit = G0 + ApPj! + OpgPi!Y—it-1 + OGY —ict—-1 + OG2Y—jc,e-1 + Ay Yit-1 + Oy2Yie_1 


+ QpyP)Vit—1 + AygVit-1Y—ict—1 + eit, (3) 


where we assume that observed achievement is a result of students’ utility-maximizing behav- 
iors. Our main parameter of interest is @py, which captures how the marginal benefits of teach- 
ing practices vary with the classroom composition.?4 Without loss of generality we set E(P,) = 
E(Y_ict—-1) = E(Yu-1) = 0. Demeaning these variables aids in interpretation of the level terms 
Qp, Ag and a, in equation (4) by making them invariant to adding the interaction terms to the 
equation while leaving the interactions unchanged. 

As discussed above, a unique aspect of these data is that teachers are randomly assigned to class- 
rooms within randomization blocks. However, even with random assignment of teachers to class- 
rooms, several important endogeneity concerns remain. First, there is considerable non-compliance 
to the random assignment in the data. Largely, this was because assignments are made from pre- 
liminary rosters before school administrators had a good sense of who would be attending their 
school. Second, classroom composition may be endogenous as principals were not required to ran- 
domly assign students to classrooms. Third, teaching practice may still be endogenous even with 


random assignment because of measurement error. We discuss each of these issues in turn. 


?4To simplify exposition, we ignore the role of other student and teacher observables though we include these 
additional controls in the analysis. 
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4.1 Non-compliance 


Because the data include an indicator of the teacher that was randomly assigned to the stu- 
dent, we can use standard approaches for dealing with non-compliance, focusing on the variation 
from the randomly assigned teacher. We focus most of our discussion around the more conserva- 
tive “intent-to-treat” estimates, which replace the observed teaching practice with the randomly- 
assigned teaching practice. Let P, denote the teaching practice of the randomly-assigned teacher, 


indexed r = r(i,t), then 


5 7 2 2 
Yit = G0 + ApPyl + Ag PY —ieyt—1 + OGY—icet—1 + OG2Yic¢_1 + AyYit-1 + Oy2Yi¢_1 


+ ApyP,1Vit—-1 + Ay Vit-1Y_it-1 + Ov + Git. (4) 


Because teachers are randomly assigned at the randomization block levels, we include randomization 
block fixed effects apy, where b = b(i,t) indexes randomization blocks. We show that our results 
are very similar when we instrument the observed with the randomly-assigned teacher’s teaching 


practice, and so choose to focus on the intent-to-treat estimates for simplicity. 


4.2 Endogeneity of classroom composition 


Classroom composition could be endogenous for two reasons. First, the principals were not re- 
quired to assign classroom composition randomly, though there was incentive to create comparable 
classrooms within randomization blocks to make the random assignment of teachers to either class- 
room palatable. Second, non-compliance by students could lead the classroom composition to be 
endogenous even after addressing non-compliance at the teacher-level. 

The question is then whether we can identify apy even though Y_ict_1 is potentially endogenous. 
Nizalova and Murtazashvili (2014) show that indeed with randomized control trials that interac- 
tions of the random treatment with endogenous characteristics are exogenous. Bun and Harrison 
(2018) expand this and provide weaker conditions for identification.2? The key assumption is that 


(Y_ic:t-1, Git) are jointly independent of P,, conditional on other controls. This means that match- 


© See also Annan and Schlenker (2015) and Di Falco et al. (2018) for other examples of applications of this argument. 
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ing of students to peers based on unobservables does not vary with teaching practice. Thus, the 
main concern is about potential re-sorting of students after teachers are randomly assigned. We do 


not believe this poses a threat to identification for several reasons, which we discuss in Section 4.4. 


4.3. Measurement error and endogeneity of teaching practice 


Recall that we have multiple observations of teaching practice taken from video observations from 
multiple raters of the teacher both in the initial observational year and in the random assignment 
year to help deal with potential measurement error in teaching practice. As in Araujo et al. (2016), 
our preferred approach is to use t — 1 measures to capture the teaching practice. This address 
two related concerns. First, video raters may have difficulty separating the teacher’s practice from 
the students they are teaching. Second, if teachers change their practice in response to classroom 
composition, then teaching practice would no longer be exogenous, violating our key identifying 
assumptions. 

Our main strategy relies on the most straightforward approach to measurement by taking simple 
averages of the measures of practice (P,:-1). To clarify the potential effects of measurement error 


on our estimates, let the subscript k capture different observations of the teaching practice, i.e., 
Pree—1 = Pp + Urkt—1- (5) 


Substituting in the the average measured practice for the true measures, we have 


5 5 2 2 
Yit = C10 + Ap Prt—1 + OpgPrt—1Y—iegt—1 + OGY —icyt—1 + OG2Yje4-1 + Oy Yit-1 + Qy2Vie_1 


+ ApyPre1Vit—-1 + QygVit-1Y—icgt-1 + Om + Vit, 


where Vt = €jt —ApUrt—1 — ApyUrt—1 Y¥—icyt—1 — ApyUrt—1 Yit_1. Note that as the number of observations 
of practice increases, t,;-1 goes toward 0, if u,,¢ is mean independent of u,,4/4 for k 4 k’. This is 
reasonable in our setting given the use of multiple trained raters to rate the same teacher, leading 


to arguably independent random draws of rater-related measurement error.7° 


2 . . . . . . 
°In earlier versions, we also tried controlling for rater fixed effects in measures of practice to account for any 
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We show results are robust to using principal component analysis to construct our measures (the 
primary approach we have seen applied in this literature) or factor models to extract the underlying 
teaching practice from multiple measures as in equation (5). We are also aware of the concern that 
simply including extracted factors in nonlinear models does not completely deal with measurement 
error. We adapt the method developed in Hausman et al. (1991) to deal with nonlinear errors in 
variables models to our setting where the nonlinearity takes the form of interactions. We describe 
this approach in detail in Appendix C.2. If anything these results imply that our estimates of the 
interactions are biased toward 0, which is typical of these types of models in the literature (Jaccard 
and Wan, 1995; Busemeyer and Jones, 1983). 

To the extent that practice is time-varying, the focus on t— 1 measures may understate the total 
effect of teaching practice. For time-varying practice, we can extract instead the common compo- 
nent from the correlation between time ¢ — 1 and ¢ practices, which captures a persistent aspect 
of teaching practice. We discuss in Section 5.3 the findings when we instrument contemporaneous 
teaching practice with t — 1 practices. These results show that if anything our estimation strategy 


provides conservative estimates of the interaction of practice with classroom composition. 


4.4 Testing identifying assumption 


We perform a number of tests to ensure that our key identifying assumption holds. First, we test 
whether the practice is mean independent of peer characteristics directly by separately regressing 
randomly assigned teaching practice (based on t — 1 averages) on different variables of classroom 
composition after controlling for randomization block fixed effects.2” Appendix Table 9 presents 
these balancing tests which show that teaching practice is not correlated with either of our mea- 
sures of classroom composition, whether we use observed peers or initially-assigned peer. Second, 
regressions of the randomly-assigned teaching practice on student-level covariates also suggest that 
random assignment of teachers held. Third, though random assignment to classroom is not needed 
for identification, Appendix Table 9 also presents balancing tests which regress student characteris- 


tics on peer characteristics to see if there is evidence of matching in the data. Again, the balancing 


systematic rater differences and again results were very similar. 
27 \ similar approach is implemented in Kane et all. (2013). 
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test generally supports that there is no matching of students (either using the observed or initially- 
assigned peers), suggesting that classroom composition does not appear to be endogenous, at least 
in terms of observables.?° That results of our balancing tests for peer characteristics and teachers 
are similar in the initially-assigned and observed classroom compositions helps to alleviate con- 
cerns about students re-sorting after teachers were randomly assigned. We can test further the 
implications for our estimation if there is some matching based on unobservables that we did not 
detect with our tests, by replacing the observed peer characteristics with the initially-assigned peer 
characteristics in our regressions. We show that results are robust to this setting in Section 5.3, 
alleviating any remaining concerns about potential violations of non-random sorting of students in 


response to randomly-assigned teachers. 


5 Results 


To ground our analysis more closely in the literature, we begin with the typical specifications that 
treat teachers and peers as separable inputs. We then add interactions with classroom composition 
to show how the significance of measured teaching practices change across these specifications. All 
estimates include controls for randomization block fixed effects, student characteristics and teacher 
aptitude, the Content Knowledge of Teaching (CKT) assessment, though results are robust to their 
exclusion.?? For the endogeneity concerns described in Section 4, we focus the initial analysis on 


lagged measures of teaching practice, and consider contemporaneous measures in Section 5.3. 


5.1 Do Teaching Practices have a Direct Effect on Test Scores? 


Panels A and B of Table 3 display estimates of the effect of classroom management and challenge/ 
student-centered practices, respectively on math performance. Even columns allow the effect of 
teaching practice to vary by a student’s initial achievement. Results in columns (1) and (2) are 


naive OLS specifications, where the lagged teaching practice of the current teacher (Pj,-1) is the 


8 We find 3 out of 22 coefficients to be statistically significantly different from 0 at the 0.1 level, which is less than 
expected by chance. 

2°See MET project (2010a) and footnote 10 for a description of this teacher assessment. The controls help with 
standard errors but do not matter for consistency because of the random assignment of treatment. 
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variable of interest. Columns (3) and (4) report intent-to-treat (ITT) estimates, replacing Pjr_1 
with the teaching practice of the randomly-assigned teacher (P,+-1). Columns (5) and (6) present 
treatment on the treated (TT) results where Pj,_1 is instrumented with P,¢_1. 

Given the breadth of the measures, it is perhaps surprising that none of the specifications (in 
both panels) show that the level of teaching practices play a statistically significant role in math 
performance.*” However, these results are consistent with the findings in Garrett and Steinberg 
(2015), where the average of all FFT measures do not seem to have a direct impact on students’ per- 
formance in their ITT and IV specifications.*! In a similar vein, while interactions of student prior 
achievement with classroom management or challenge/student-centered practices are statistically 
significant in ITT and IV specifications, F-tests (reported at the bottom of each panel) show that 
the coefficients associated with these practices are in many specifications not jointly significant. 
At first glance, these findings suggest that our constructs of teaching practice may not capture an 
aspect of teaching practice that is meaningful for math performance. However, the next section 
shows that these conclusions are misleading when we build in complementarities between teaching 


practice and peers. 


5.2 Teaching Practice and Classroom Composition 


We expand the previous analysis by fully estimating equation (3), including interactions between 
classroom composition and teaching practice. Panels A and B of Table 4 present results for class- 
room management and challenge/student-centered practices, respectively. Odd columns accommo- 
date models where average peer prior achievement is interacted with teaching practice (in addition 
to student prior achievement), while even columns additionally control for classroom interquartile 
range and its interaction with teaching practice.?? Columns (1) and (2) report ITT results (i.e. 
Pjt-1 is replaced with P,,-1 as per equation (4)). Columns (3) and (4) report TT estimates where 


Pjt-1 is instrumented with P,4-1. 


2°These results also holds if instead of using averages of the sub-domains, we consider a principal component 
approach or the Hausman et al. (1991) econometric strategy described in Appendix C.2. 

31 We also performed ITT specifications where average of all FFT measures are included as a regressor (instead of 
the subdomains). Our results also show that there is no direct impact of average FFT on student achievement. 

*°See footnote 22 for an explanation of why we include IQR in our specifications rather than standard deviation. 
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Table 3: Effects of Teaching Practice without Classroom Interactions 


Actual Random IV Actual with 
‘Teacher ‘Teacher Rand. Teacher 


(1) (2) (3) (4) (5) (6) 


Panel A 

Classroom Management 0.005 0.005 0.009 0.009 0.010 0.009 
(0.025) (0.024) (0.022) (0.021) (0.025) (0.024) 

C.M. x Math:-1 0.018 0.022* 0.023* 

(0.013) (0.013) (0.013) 

Math;,-1 O73 Ciro Ostae Oat Osiaa  “ OLae 
(0.018) (0.018) (0.018) (0.018) (0.018) (0.018) 

Avg Peer Math;_1 0.013 0.014 0.013 0.014 0.014 0.015 
(0.026) (0.026) (0.026) (0.026) (0.025) (0.025) 

P-value (joint signif. of teach- 0.380 0.239 0.219 

ing practice) 

F-Stat. (first stage)! 251.3 167.1 


Panel B 
Challenge /Student-Centered 0.021 0.019 0.025 0.023 0.029 0.026 
(0.022) (0.022) (0.021) (0.021) (0.024) (0.024) 


C.S.C. x Math 0.017 0.025* 0.026* 
(0.013) (0.013) (0.014) 
Mathy— 0.737*** 0.737*** 0.737*** 0.737*** 0.738*** 0.738*** 


(0.018) (0.018) (0.018) (0.018) (0.018) (0.018) 


Avg Peer Math;_1 0.015 0.013 0.014 0.011 0.016 0.014 
(0.026) (0.026) (0.026) (0.025) (0.026) (0.025) 


P-value (joint signif. of teach- 
ing practice) 
F-Stat. (first stage)! 279.8 186.6 


0.195 0.042 0.029 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are 
clustered at the randomization block level. Panel A and B correspond to different regressions with math as 
the dependent variable. Lagged teaching practices are used and sample size is 2632. These regressions include 
randomization block fixed effects and controls for the level and a squared term of prior math achievement 
and average peer prior achievement, as well as CKT and student characteristics listed in Table 1. — Reports 
the Kleibergen-Paap rk Wald statistic for a weak instrument test. 
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Panel A shows that classrooms benefit more from higher average peer initial achievement when 
the teacher uses good classroom management practices, which is consistent with the mechanisms 
discussed in our model. For example, ITT and TT results show that a one standard deviation 
increase in classroom management increase test scores around 7.4% to 8.9% of a standard deviation 
when peer average prior year performance is one standard deviation above the mean. In contrast, 
the even columns show that the effectiveness of classroom management practices does not vary 
significantly with the IQR in classroom prior achievement. On the one hand, these results have the 
intuitive interpretation that a student cannot benefit from higher-achieving peers if the teacher does 
not have good classroom management practices, which would foster positive classroom behaviors. 
On the other hand, it could be expected that classroom management practices are more effective 
among low-achieving students. Instead, our finding is consistent with the understanding that 
classroom management is also an important challenge in higher-achieving classrooms, though the 
sources of disengagement may be different from in lower-achieving classrooms (Shernoff et al., 2003). 

Furthermore, consistent with the results in Table 3, the level effects of classroom manage- 
ment practices are still not statistically significantly different from 0 and point estimates are small. 
Moreover, the interactions between classroom management and student’s prior achievement become 
statistically insignificant in most specifications, suggesting that failure to account for complementar- 
ities with classroom composition may lead to stronger conclusions about student-level heterogeneity 
in the effects of teaching practice. A further notable change is that classroom management emerges 
as a jointly statistically significant predictor of test performance when interacted with average peer 
prior achievement at the 99% confidence level in most specifications. 

Panel B shows results for challenge/student-centered practices. Generally, we find that classes 
with higher average initial achievement also benefit more for challenge/student-centered practices. 
However, the benefits of challenge/student-centered practices are smaller in classrooms with higher 
IQR in initial achievement. A standard deviation increase in this practice leads to a 5 to 6% 
reduction in achievement for classrooms that are a standard deviation above average IQR. Like 
in the case of classroom management, the level effect of challenge/student-centered practices are 


not statistically significantly different from 0 and neither are the interactions with initial achieve- 
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ment, after controlling for interactions with classroom composition. Furthermore, joint tests also 
confirm that challenge/student-centered practices emerge as statistically significant predictors of 
achievement at the 99% confidence level after permitting heterogeneity by classroom composition. 

In summary, the findings in Table 4 provide four main messages.*? First, teaching practices 
seem to show significant complementarities with classroom characteristics, ranging in magnitude 
from 3% to 8.9% of a standard deviation increase in math, for a standard deviation increase in 
teaching practice in a class that is one standard deviation above the mean in prior performance. 
We view these estimates as sizable given that some of the larger estimates of a standard deviation 
increase in teacher value-added on math scores range from 0.11 to 0.16 (Chetty et al., 2014). A 
standard finding in the literature is that the first two years of teacher experience, where experience 
effects are largest, increase student performance by only 0.06 of a standard deviation (Ladd and 
Thompson, 2008). 

Second, failure to account for complementarities with classroom composition lead us to un- 
derstate the importance of these teaching practices. Third, student-level heterogeneity in ef- 
fects of teaching practice appear less relevant after accounting for the complementarities with 
classroom composition. Finally, the contrasting evidence between classroom management and 
challenge/student-centered practices also points to the importance of considering these measures 
separately, i.e., a single measure of teaching quality, the focus in the literature, does not fit the 
findings when we allow for classroom context to moderate effects. We return to explore this in 


more detail in Section 6. 


5.3. Robustness 


Endogeneity of classroom composition Given that teachers are randomly assigned to class- 
rooms and that we focus on ¢t — 1 practices, a primary remaining endogeneity concern, as discussed 


in Section 4, is potential resorting of students to classrooms based on the teacher who is randomly 


We also performed similar regressions on a broader sample to test to what extent our sample restrictions described 
in Appendix C.1 are affecting our results. If instead we restrict the data to: a) fourth and fifth grade students who 
were randomly assigned a teacher, b) both a student’s actual and randomly assigned teacher have non-missing year 1 
teaching practice measures, c) students have non-missing test scores in both years, and d) keep randomization blocks 
with at least a 25% compliance rate, leads to a sample of 4086 students. This larger sample provides very similar 
results to those reported in columns (3) and (4) of Table 4. 
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Table 4: Teaching Practice and Classroom Composition 


Random IV Actual with Random 
Teacher Rand. Teacher Teacher and 
Class 
(1) (2) (3) (4) (5) (6) 
Panel A 
Classroom Management 0.005 0.008 0.002 0.006 0.003 0.005 
(0.018) (0.019) (0.021) (0.021) (0.019) (0.020) 
C.M. x Math;_j 0.011 0.011 0.011 0.011 0.015 0.014 
(0.013) (0.012) (0.013) (0.012) (0.013) (0.012) 
C.M. x Avg. Peer Math;_1 0.079*** 0.074*** 0.089*** 0.084*** 0.056*** 0.051** 
(0.021) (0.025) (0.023) (0.030) (0.020) (0.022) 
C.M. x IQR Peer Math;_1 -0.017 -0.014 -0.017 
(0.019) (0.023) (0.018) 
P-value (omt signify of teach 9). 9.000 0.000 0.000. 0.018 “0.007 
ing practice) 
First Stage F-Stat.! 84.4 42.9 
Panel B 
Challenge /Student-Centered 0.018 0.018 0.017 0.015 0.017 0.014 
(0.022) (0.020) (0.026) (0.023) (0.023) (0.022) 
C.S.C x Math; 0.016 0.012 0.017 0.013 0.022* 0.020 
(0.012) (0.012) (0.013) (0.013) (0.013) (0.012) 
C.S.C x Avg Peer Math,_1 0.044*** 0.031** 0.050*** 0.037** 0.035** 0.039** 
(0.016) (0.014) (0.019) (0.017) (0.016) (0.015) 
C.S.C. x IQR Peer Math;_1 —0.053*** —0.058*** —0.037*** 
(0.014) (0.014) (0.013) 
P-value (joint signif. of teach- 9 gg, g.990 0.002 -—«0.000 «0.005 0.000 
ing practice) 
First Stage F-Statistict 67.1 53.4 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are clustered 
at the randomization block level. Sample size is 2632. Lagged teaching practices are used throughout; 
columns (5) and (6) control for characteristics of initially randomly assigned peers. Panel A and B correspond 
to different regressions with math as the dependent variable. These regressions include randomization block 
fixed effects and controls for the level and a squared term of prior math achievement and average peer prior 
achievement, as well as CKT and student characteristics listed in Table 1. Even columns also include the 
IQR in peer prior achievement. Whenever peer variables are included we also include their square, and 
all pairwise interactions of peer variables and prior achievement. +Reports the Kleibergen-Paap rk Wald 


statistic for a weak instrument test. 
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assigned. Balancing tests reported in Section 4.4 already suggest that this is not the case, in that 
observable student and peer characteristics are not correlated with the randomly-assigned teacher’s 
practice. However, given that we observe the students who were initially randomly assigned to the 
teacher, we can also test whether estimates of the interaction are systematically different if we re- 
place actual peers with randomly-assigned peers. These estimates are reported in columns (5) and 
(6) of Table 4. Interactions between classroom composition and teaching practice are not statisti- 
cally significantly different from their comparable estimates in columns (1) and (2), though smaller 
in magnitude. This is consistent with a slight downward bias in columns (5) and (6) generated 


from random measurement error in peers. 


Contemporaneous teaching practice One implication of focusing on lagged measures of teach- 
ing practice is that our estimates of the interactions between classroom composition and teaching 
practice may understate the true effects. While we prefer focusing on these conservative estimates 
because of concerns about the endogeneity of contemporaneous teaching practice, we also explore 
how the interactions of teaching practice with classroom composition change when we instrument 
for contemporaneous teaching practices with lagged teaching practices. These results are presented 
in Appendix table 10 and discussed in detail in Appendix C.1. We show that interactions between 
teaching practice and classroom composition remain robust, but (as expected) are significantly 


larger in magnitude. 


Measurement error in teaching practice An additional concern with our findings is to what 
extent our results (e.g. lack of significance in the level of the teaching practice measures) are affected 
by problems of measurement error in our key teaching practice variables. In order to address this 
point, we implement a measurement error correction strategy that follows Hausman et al. (1991). 
This approach is more convenient than the usual IV strategy that accounts for error in variables, 
because the variables of interest enter non-linearly into our model and we are over-identified by 
having more than 2 measures of each practice. In appendix C.2, we provide a description of how 
we adapt the Hausman et al. (1991) method to our context, and describe results obtained after 


implementing it. For completeness, we also report results when performing IV corrections (i.e. 
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instrumenting one of the measures that corresponds to a given teaching practice with the remaining 
measures of that teaching practice). Overall, the findings indicate that our current strategy of taking 
averages of the teaching practice variables provides similar results to strategies that correct for 
measurement error following these alternative approaches. The level effects and interactions with 
initial achievement remain close to 0, but the interactions with classroom composition increase 


slightly after correcting for measurement error. 


Aggregation We also consider whether our aggregation of the 8 subdomains into 2 separable 
components masks important heterogeneity. Appendix Table 12 shows that similar findings hold 
when we consider the 8 different subdomains. The estimates, while still significant, are generally 


smaller in magnitude as would be expected due to increased measurement error.>* 


6 Mechanisms 


6.1 Teaching Practice vs. Teacher “Quality” 


While the previous section provides compelling evidence that teacher effectiveness varies by class- 
room composition, we now explore the extent to which classroom-management and challenge/student- 
centered practices may proxy for similar aspects of teacher effectiveness and/or whether more stan- 
dard, unidimensional measures of teacher quality are the primary channel through which our teacher 
effectiveness measures operate. For instance, teachers who have better classroom management prac- 
tices may also engage in more challenge/student-centered practices; therefore not including both 
domains in the same specification may bias our estimates. This exploration raises a number of 
interesting questions. To be clear, there is no consensus on how teaching quality should be mea- 
sured, and FFT was designed to capture different aspects of effective teaching. This means that in 
some ways classroom management and challenge/student-centered practices are in fact measures of 
quality. Furthermore, the fact that classroom-management and challenge/student-centered prac- 
tices interact differently with classroom composition already suggests that a single unidimensional 


quality may not be correct. Yet, we have other relevant unidimensional scales of quality, such as 


34S ¢e¢e Appendix section C.3 for related discussion. 
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the Content Knowledge for Teaching assessment, as well as principal and student surveys, which 
we consider here. 

In order to address these key points, Table 5, Columns (1) and (2) present ITT (ie. Pjt_1 
is replaced with P,4-1) and IV (ie. Pjt-1 is instrumented with P,~-1) results from a model that 
simultaneously controls for classroom management and challenge/student-centered practices and 
their interactions with peer composition. These results show that interactions of classroom man- 
agement with the average peer initial achievement are robust, but seem to explain the interaction 
of challenge/student-centered practices with the average peer initial achievement in the previous 
tables because of strong correlations between these two practices. In contrast, interactions of 
challenge/student-centered practices with the IQR in peer initial achievement remain robust.®? Fi- 
nally, in comparing the results in Tables 4 and 5 we see that key classroom composition interactions 
become stronger when both teaching practices are included in a single specification. This suggests 
that if there is a bias in our interactions from omitted teaching practice/quality, it is leading us to 
understate the true complementarity with classroom composition. 

Columns (3) to (5) of Table 5 report results from ITT specifications similar to column (1) where 
we additionally include different proxies for overall teacher “quality” and their interactions with 
classroom characteristics.°° First, we included teacher performance in the Content Knowledge for 
Teaching (CKT) assessment interacted with classroom characteristics. Second, we included the 
teacher’s lagged average score on student assessments from the TRIPOD survey. TRIPOD as- 
sesses the extent to which students experience the classroom environment as engaging, demanding, 
and supportive of their intellectual growth.?” Finally, we included school principal evaluations on 
teachers performance which are reported in the MET database.*® These results show that across 


all specifications our key interactions between teaching practices and classroom composition re- 


3° A ppendix tables 13, 14, and 15 report all the parameters of these specifications. 

36Notice that in all previous specifications, we were controlling for a measure of teacher aptitude (i.e. CKT), but it 
was not interacted with classroom characteristics. We cannot control for the usual measures of teacher value-added 
(ie. adjusted random effects) because these models inherently neglect the presence of classroom-teacher interactions. 

°" Tripod is a protocol that measures teacher effectiveness based on student surveys. See Kane and Staiger (2012) 
for a description of this tool and the importance for predicting teacher value-added. 

°8The fact that our specifications include randomization blocks (which in this case are school-grade fixed effects) 
should account for systematic difference in principals’ reporting. 
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main significant, and the size of these coefficients is very similar to our previous specifications.°” 


Furthermore, we see that these alternative measures of “quality” do not interact with peer average 
initial achievement and IQR in the same way as our two practices. This is true despite CKT and 
principal surveys being statistically significant predictors of math achievement. In contrast to our 
practice measures, these show statistically significant heterogeneity in effects by the student’s ini- 
tial achievement, suggesting that “quality” as measured through CKT and principal assessments 


matters more for better students. 


6.2 Class size 


Because IQR is correlated with class size, an interesting question is whether interactions of chal- 
lenge/ student-centered practices are driven by larger class sizes. We test this by adding interactions 
of classroom management and challenge/student-centered with class size to column (1) of Table 5. 
We do not include these results as we find no evidence that either practices interacts with class size. 
Furthermore, positive interactions of classroom management and average peer prior achievement 
and negative interactions of challenge/student-centered practices with the IQR remain robust, and 


if anything increase in magnitude with the additional controls. 


7 Evaluating Teachers 


The development of teacher evaluation protocols have been at the core of the education policy 
debate in recent decades. For example, policymakers in the US have widely implemented account- 
ability programs that intended to reward or punish teachers based on students gains in achieve- 
ment.*° More recently, schools have also incorporated classroom observation protocols like FFT to 
further assess teachers. For example, in 2012, the New Jersey Legislature passed the TEACHNJ 
Act, which mandated implementation of a new teacher evaluation system starting in the 2018 - 


2014 school year and links tenure decisions to evaluation ratings. In response to this mandate, New 


°° Specifications than include teacher experience (as an alternative teacher control) and its interactions with class- 
room and students characteristics provide almost identical results to those reported in columns (3)-(5) of Table 
5. 

40For example, North Carolina implemented the ABCs (Accountability for Basic skills and for local Control) 
program in 1997, and the US federal government developed the NCLB (No Child Left Behind) program in 2002. 
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Table 5: Teaching Practices and Alternative Teacher “Quality” Controls 


Random IV Actual 


Teacher with Random 


Random Teacher 
Alt. Teacher Control: 


Teacher CKT 7C PSVY 
(1) (2) (3) (4) (5) 
Classroom Management —0.012 —0.016 —0.014 -—0.016 —0.015 
(0.020) (0.022) (0.020) (0.020) (0.019) 
C.M. x Math;_1 0.004 0.004 0.011 0.004 0.003 
(0.020) (0.021) (0.019) (0.019) (0.019) 
C.M. x Avg Peer Math;_1 0.076**  0.087** 0.077**  0.076**  0.076*** 
(0.029) (0.036) (0.030) (0.029) (0.027) 
C.M.x IQR Peer Math;_1 0.026 0.035 0.026 0.026 0.026 
(0.022) (0.026) (0.022) (0.023) (0.021) 
Challenge /Student-Centered 0.026 0.025 0.026 0.026 0.011 
(0.023) (0.025) (0.022) (0.022) (0.024) 
C.S.C. x Math;_1 0.010 0.011 0.002 0.016 0.005 
(0.020) (0.021) (0.020) (0.019) — (0.019) 
C.8.C. x Avg Peer Math, —0.010 —0.009 —0.010 —0.010  —0.005 
(0.019) (0.022) (0.019) (0.019) (0.019) 
C.S.C. x IQR Peer Math;_1 —0.062*** —0.071*** —0.063*** —0.057** —0.054** 
(0.017) (0.019) (0.017) (0.021) (0.021) 
Alt. Teacher Control —0.008 —0.006 0.055*** 
(0.016) (0.019) (0.017) 
T.C. x Mathy_1 0.044*** —0.029** 0.032** 
(0.014) (0.013) (0.013) 
T.C. x Avg Peer Math;_1 —0.019 —0.007  —0.016 
(0.018) (0.020) (0.016) 
T.C. x IQR Peer Math;_1 —0.012 —0.017  —0.003 
(0.021) (0.021) (0.016) 
aie Home. os 0.000 0.000 0.000 0.000 0.002 
oS 0a2 01m aon 
First Stage F-Statistic! 200 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are 
clustered at the randomization block level. Sample size is 2632. Dependent variable is math and teaching 
practices are measured at t — 1. Regressions use lagged teaching practice of current teacher and include 
randomization block fixed effects and controls for the level and a squared term of prior math achievement 
and average and IQR of peer prior achievement, their square and all pairwise interactions of peer variables 
and prior achievement, as well as student characteristics listed in Table 1. + Reports the Kleibergen-Paap 
rk Wald statistic for a weak instrument test. CKT déMbtes Content Knowledge for Teaching assessment, 7C 
denotes overall student survey teacher ratings based on Tripod and PSVY denotes principal assessments 
of teacher quality. TC denotes alternative teacher control (i.e. CKT, 7C, or PSVY). See Appendix Tables 


(13), (14) and (15) for all parameters. 


Jersey has developed the program AchieveNJ that relies on classroom observation protocols such 
as FFT to evaluate teachers.*! 

Due to the increasing availability of data to evaluate teachers (e.g. value-added measures), 
many scholars have highlighted the benefits of replacing the least effective 5% to 7% of teachers 
with average teachers (Chetty et al., 2014; Hanushek, 2011). However, these exercises in evaluating 
teachers and determining rankings rely crucially on the assumption that teacher effectiveness can 
be isolated from classroom characteristics. Our findings of a statistically significant complemen- 
tarity between teaching practices and classroom composition suggest that the estimated teacher 
contributions will depend on classroom composition. To further quantify the relevance of these 
complementarities within the context of our empirical strategy, we implemented different simula- 
tion exercises that aim to determine how rankings of teacher contributions to learning vary when 
teachers are re-allocated into different classrooms. 

We focus on our estimates in column (5) of Table 5 to create, based on our measures of teaching 
practices, a measure for teacher contribution to learning. We choose these estimates because they 
include a role for overall teacher quality through principal surveys, which is shown to have a direct 
effect on math achievement and so offers a useful anchoring of our two FFT practices, which do 
not have a statistically significant direct effect. We define the teacher contribution as 

Nj 


r 1 : : : Z : 
LCs a So (GpPil Gg hl Yai Ph Opp ll Yee a Gp POR 621), 
J j=1 


where JQR-.,4-1 denotes the interquantile range of the classes prior achievement, N; denotes the 
class size for teacher j, ap; the related coefficient estimate and P; is a vector of classroom manage- 
ment practices, challenge/student-centered practices (using the average prior measures as discussed 
above) and the principal’s evaluation of teacher quality. We explore two different thought experi- 
ments to understand the magnitude of the effects. 

First, consider a teacher whose classroom management, challenge/student-centered and princi- 


pal evaluation are all one standard deviation above average. Giving this teacher a classroom that 


4 Teacher practice accounts for 55% of the teacher evaluations. The following link provides the list of approved 
teacher practice evaluation instruments used in New Jersey: https: //www.nj.gov/education/AchieveNJ/teacher/ 
approvedlist.pdf 
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is one standard deviation higher (relative to the mean) average classroom peer prior achievement 
(holding IQR fixed) would increase her teaching contribution to student learning (TC) by 0.07 
and increase her rank in the teacher contribution distribution by 0.20 percentile points.*? In a 
similar vein, holding the average peer initial achievement fixed and increasing IQR to one standard 
deviation would decrease teacher contribution by 0.03 and decrease her rank by 0.14 percentiles 
Second, suppose we randomly reallocate teachers to one of the other classrooms within a ran- 
domization block, potentially similar to the usual allocation problem that principals face every year 
for a given math course/level, which is how the randomization blocks were defined. In this case, 
TO changes in absolute value by 0.11 of a standard deviation and the teacher’s rank changes on 
average by 0.32 percentiles. As a result, within the randomization block, 80% of teachers change 
rank. These counterfactual changes in rank are not trivial from a policy perspective. Our simu- 
lation shows that more than 3/4 of the teachers that were ranked in the bottom 5 to 20% of the 
teaching contribution would no longer belong to that group after such re-allocation of classrooms. 
In summary, simulations suggest that complementarities between teaching practices and class- 
room composition play a key role in determining teacher contributions to learning. In this regard, 
our findings suggest caution on how to implement policies that aim to replace the bottom 5% 
of teachers given that teachers relative ranks are likely to depend on the characteristics of the 


classroom they are facing. 


8 Conclusion 


In this paper, we illustrate that the effects of teaching practice vary significantly with classroom 
composition. Our preferred estimates indicate that classroom management practices increase math 
achievement by 0.09 of a standard deviation when average classroom initial peer math performance 
is 1 standard deviation above average. In contrast, challenge/student-centered practices decrease 
math performance by -0.07 of a standard deviation when the classroom IQR in initial achievement 


is 1 standard deviation above average. We view these estimates as sizable given that some of the 


“Note that 1 standard deviation average peer achievement (which is in standard deviation units) also increases 
prior student achievement by 0.42. 
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larger estimates of a standard deviation increase in teacher value-added, which is based on unob- 
servable teacher contributions to math, range from 0.11 to 0.16 (Chetty et al., 2014). Simulations 
also illustrate the reassigning classrooms to teachers within randomization block would change 
their teacher contribution by 0.11 standard deviations and change their rank on average by 0.32 
percentiles. 

We make three key contributions to the literature on teacher effectiveness. First, we illustrate 
that failure to account for moderating effects of classroom composition may lead researchers to 
severely misstate the importance of a given measured teaching practice for achievement. This helps 
address the common mystery of why teacher effectiveness is so hard to measure and may even 
help reconcile mixed findings in different contexts. Second, failure to account for the moderating 
effects of classroom composition also leads us to overstate the importance of individual student- 
level heterogeneity in the effects of teaching practice. Indeed, in our context, it appears that all 
heterogeneity is driven by classroom composition. 

Third, the focus in the literature on a single, unidimensional measure of teacher effectiveness 
may be misguided. Our two measures of teacher effectiveness interact with different aspects of 
classroom composition. Furthermore, we show that our estimated interactions of teaching practice 
with classroom composition remain after controlling for additional standard measures of teacher 
“quality,” such as Content Knowledge for Teaching Assessment, student evaluations and principal 
surveys. In contrast to our FFT-based teaching practice measures, these additional measures of 
teacher quality show some evidence of heterogeneity by student prior achievement, but do not 
interact with classroom composition, providing supportive. 

Our findings also have important implications for the peer effects literature. Because the effects 
of peers vary significantly with teaching practice, this suggests that failure to account for these in- 
teractions may also severely understate the importance of peers in different contexts. Furthermore, 
it suggests the potential for a change in policy emphasis from reallocating students to classrooms to 
meet different achievement objectives (which can be costly and involve severe tradeoffs among dif- 
ferent types of students) to determining teaching practices that best fit different classroom contexts 


or better-matching of teachers to classrooms. 
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Finally, our results also have important implications for policies related to (1) teacher evaluation 
and accountability and (2) teacher professional development and training. Classroom observations 
of teaching practice-scored using the FFT and other protocols—are now routinely used in annual 
teacher evaluation and accountability. Our findings suggest that, depending on teachers’ assign- 
ments or the overall school context, specific domains of instructional practice may be more relevant 
to teacher effectiveness than others. As such, specific domains of instruction (rather than an 
overall observational score) may be emphasized in accountability systems depending on teaching 
assignments and/or school context. Moreover, the presence of complementarities between teach- 
ing practices and classroom composition suggest that further caution should be exercised when 
teachers are evaluated based on these protocols. In particular, we show that rankings of teacher 
contributions to learning are largely sensitive to even relatively small within-school classroom re- 
assignments. Therefore policies that aim to replace the bottom 5% to 10% of teachers could be 
sensitive to the characteristics of the classrooms that teachers face. 

In terms of teacher professional development and training, our findings reinforce the importance 
of explicit attention to challenges stemming from classroom-achievement heterogeneity (Cohen and 
Lotan, 1997; Seaton et al., 2010). We further find that scores on protocol subdomains do not appear 
to be as orthogonal in practice as they are in principle, or are intended to be.*? Further research 
could benefit from determining how to more fully differentiate, to the extent it is feasible, different 
aspects of teaching practice to make more formative recommendations for teacher training and 
development. That said, our research provides compelling evidence that any such recommendations 


should be adapted to the challenges faced by different school and classroom contexts. 


That is, the MET observational protocol seem to have been developed as formative measures of instruction, where 
ideally the protocol would be useful in assessing “weak points” to target for instructional improvement. This is our 
own interpretation of these protocol. The supporting documentation we examined for the FFT protocol for example, 
does not specifically address the extent to which it was designed to measure a formative construct (Danielson, 2011, 
2012). 
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A Randomization and Sample Selection 


Randomization: When schools joined the MET study in 2009-2010, principals were asked to 
identify groups of teachers that 1) were teaching the same subject to students in the same grade 
2) were certified to teach common classes and 3) were expected to teach the same subject to 
students in the same grade the following year. These groups of teachers were called “exchange 
groups” The plan was for principals to create class rosters as similar as possible within an exchange 
group, and then send these rosters to MET to be randomly assigned to “exchangeable” teachers. 
One issue in practice was that when it came time to perform the randomization, not all teachers 
within an exchange group were able to teach during a common period. As a result, randomization 
was performed within subsets of exchange groups called “randomization blocks.” In summary, 
MET requested scheduling information for 2,462 teachers from 865 exchange groups in 316 schools. 
From this, they created 668 randomization blocks from 619 exchange groups in 284 participating 
schools. The drop off in teachers can be attributed to either a school refusing to permit randomly 
swapping rosters, or all remaining MET project teachers leaving the school or the study prior to 
randomization. From these randomization blocks, 1,591 teachers were randomly assigned to class 
rosters. Teachers were lost either because they were not scheduled to teach the exchange group 
subject and grade level in 2010-2011 or they decided not to participate Kane et al. (2013). 
Since assignments were made based on preliminary rosters at the end of the previous school 
year, before school administrators knew who would be attending their school, there was both 
attrition from the sample and additional students who moved into the school and needed to be 
incorporated in the sample. As a result, our analysis does not rely on the assumption that the 
observed classroom composition is random, but rather exploits what we know to be random—the 
initial random assignment of teachers to classrooms. We discuss this further in Section 4. We cannot 
include students who were not in the randomization sample in our main analysis, which relies on 
the randomization, but we do include them as part of the calculation of classroom composition 
when prior test scores are available. For the average student in our final sample, 78% of classroom 
peers were included in randomization, and we observe prior test scores for 91% of classroom peers. 


Sample Selection: Our sample selection is motivated by our estimation strategy. We start with 
the entire sample of elementary students observed in the randomization year (2010-11), in either a 
math or joint math and ELA classroom, which includes 11,409 student observations. Since we rely 
on the random assignment of teachers to classrooms, we restrict the sample to the 5,730 students 
who were randomly assigned a teacher (but did not necessarily comply). The characteristics of 
these students are summarized in appendix Table (8). Note that while six districts participated, 
only five were asked to have elementary schools participate. 

Further sample restrictions are necessary for our estimation strategy. We require observed test 
scores prior to, and after the randomization year. We also required non-missing teaching practices 
from the first year (2009-10) and Content Knowledge for Teaching (CKT) scores in math. We 
restrict the sample to students whose actual and randomly assigned teacher has non-missing values 
for both, which reduces the sample to 4,201. 

Using the remaining students we count the number of students per class, and restrict the sample 
to all classes with a minimum of 7 students. From this restriction we are left with 4,124 students. 


“The number of randomized teachers includes 386 high school teachers and 24 teachers from grades 4-8 for whom 
rosters were later found to be invalid by MET. We do not include these in our sample. 
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While the true class sizes are much larger than this, we do this to avoid the possibility of results 
being driven by unusually small classes based on our previous sample restrictions. 

Finally, our estimation strategy requires a minimum of two teachers per randomization block, 
and we also want to ensure randomization was performed properly. There are 3,618 students in 
a randomization block with at least two teachers. Of these remaining students, 2,682 are in a 
randomization block with at least a 50% compliance rate. 

At this point, we find there are 44 duplicate student observations between classes, which we 
drop. We then re-run the class size, teachers per randomization block, and randomization block 
compliance rate restrictions. 

The final restricted regression sample has 2,632 student observations. These student observa- 
tions span 5 districts, 39 schools, 57 randomization blocks, 147 teachers, 147 classrooms, with 87% 
of student observations coming from joint math/ela courses. Table (1) presents summary statistics 
of our final regression sample. 


B’- Alternative Models 


While the behavioral model in Section 3 posits some possible channels of complementarities, alterna- 
tive plausible models of student behavior would produce similar complementarities. For instance, it 
is straightforward to add to the model that students conform to the average behavior of classmates, 
so that utility is 


Vie = Vy Vit — P (bit — y6-it)? + Yop P bit. 


This captures the conformity-type peer effects that are the focus of the social interactions literature 
(Brock and Durlauf, 2001; Epple and Romano, 2010). In this case, optimal behavior would be a 
function of peer behavior and teaching practice and similar results would follow, except here the 
benefits of the teaching practice are amplified through the re-enforcing behavior of peers. For 
instance, a teacher’s classroom management practice encourages a student and her peers to behave 
better, and the better behavior of peers further encourages the student’s own better behavior 
and vice-versa. The interaction between teaching practice and peer initial achievement would 
follow again in this model because the marginal product of good behavior differs with peer initial 
achievement. 

Furthermore, we could also motivate the interaction between teachers and peers as arising 
through a production function that has complementarities between average peer behavior and own 
behavior, i.e., 


Yit = Bo + Bobit + Boybit Yie-1 + Bogie (Y_icyt—1) + Bypdied—ie + Bpb-it 
+ ByYit—1 + Bgm(Y—set—1) + BoP; + BpyPiYit-1 + Bog Pim(Y—set—1) + Est, 


where there are direct spillovers from peer behavior and the achievement benefits of behavior are 
increasing in peer behavior. This channel connects well with Lazear (2001)’s classic treatment of 
the classroom learning environment as a public good that is disrupted by student behaviors. The 
reduced form in this setting would be similar in structure to the above, when m(Y_ic,t-1) = Y. eb 
with the addition of the Pe term arising through the interaction of own and peer behavior, both 


of which are increasing in P;. 
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C Robustness 


C.1 Robustness check on contemporaneous teaching practice 


Table 10 shows robustness checks where we estimate equation (4) focusing on our key interactions— 
challenge/student-centered practices interacted with the IQR in initial peer achievement and class- 
room management interacted with average initial peer achievement. A challenge we face is that it 
is difficult to instrument for all the entries of teaching practice and its interactions without run- 
ning into a weak instrument problem. As a result, we build the argument sequentially to show 
that weak instruments are not driving estimates of our key interactions. Panel A shows results 
for classroom management and panel B for challenge/student-centered practices. The first column 
shows results for the ITT when Pyt—1, Prt—1Yit—1, Prt—1Y. ict—-1, Prr-1L1QRe,4-1 are all included in 
the regression. Then, column (2) shows that estimates of key interactions are robust when all other 
teaching practice terms are dropped except our main interactions of interest, i-e., Py1i[QRe+-1 
for challenge/student-centered and ee ae for classroom management. Column (3) then 
instruments for Pay Gee 4 with Poa soe for classroom management and P~.JQRe,.—1 with 
Prr-i1TQRe,t—1 for challenge/student-centered. The F-statistics for weak instrument tests are in 
both cases are 28 and 27 respectively, indicating that there is not a weak instrument problem. 
And, in both cases the estimated interactions are significantly larger, increasing from 0.08 to 0.22 
for the case of classroom management with the average and -0.06 to -0.18 for student-centered 
practices with IQR. 

Column (4) shows another variation of this when we continue to control for P.4-1, Pr—1Yit-1, 
but only drop from the regression the irrelevant peer interactions, i.e., the interactions with IQR 
for classroom management and average initial peer achievement for challenge/student-centered. 
Column (5) controls for contemporaneous teaching practice in levels and interacted with prior 
achievement (P,:, P-tYit-1) and only instruments for key interactions of contemporaneous teach- 
ing practice with peer variables (P.:Y_ic,z-1 with P,+-1Y_ic,t_1 for classroom management and 
PrIQRet—1 with Pr—1IQRe+-1 for challenge/student-centered). Again, F-statistics for the weak 
instrument test are in all cases above 20 and the key variables of interest remain very similar to 
estimates in column (3) that do not control for level effects or interactions with initial achieve- 
ment. Finally, column (6) instruments for all entries of contemporaneous teaching practice (i.e., 
Prt, PrteYit_1 are also instrumented with P41 and P,.-1Yi-1), along with the key classroom com- 
position interactions as in column (5). In this case, F-statistics on tests for weak instruments drop 
below 10, but we see that the estimated interactions with classroom composition remain remarkably 
stable, suggesting that estimates are not driven by weak instruments. 


C.2 Nonlinear Measurement Error 


To show how Hausman et al. (1991) can be adapted to our setting to deal with measurement error 
in teaching practice, we consider a simplified version of our main estimating equation (4). Let Y 
denotes Y demeaned at the randomization block level and similarly for other variables, then 


Vie = Gp Py + apg PY ine + OGY seat Hy Yur + Oy PV ae Hex. (6) 


Recall that P, is the true practice, but it is measured with error. We adapt Hausman et al. (1991) 
in two ways. First, we relax the assumptions on the measurement model because we have more than 
2 measures for each practice. Second, we adapt their approach which was made for nonlinearities 
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captured by polynomials in the variable of interest to our setting, where nonlinearities arise from 
interactions. 
The parameters of equation (6) are identified from 


E(Vjt) = opE(P,) + opgE (Pr¥ ~icyt—1) + ag E(Y—icgt—1) + yE(Vit-1) + OpyE(PrY it-1) 
(7) 
E(YitP,) = apE(P,P,) + Oipg E(Pr¥ —ic,t-1P +) ah agE(Y_iot-1P,) + dyE(Vit1P,) 
a Opy E(P,Y 4-1P,) 
BVGY Ge¢-4) = Op E(PpY_icgt-1) a i CL) queen care ay + OEY que Vp) 5) + Oy E(Yie_1¥_icgt1) 
5 Otpy E(PrY it_1Y—icyt-1) 
E(YitPeY —iegt1) = OpE(BeP-Y iegt1) + OpgE( PY —tegt_1Pr¥ sexe 1) + Og E(Yseut1P-¥ —icgt1) 
a oy E(Vis-1P-Y —iost—1) at Big PLY as Ben) 
E(YitYit-1) = ay E(P,Yit-1) + Reg CPY py AVe4) GHEY 259V 4-4) oF ay E (Yit-1Yit—1) 
+ OpyE(PrY i¢1Yit—-1) 
E(YitP-Y it—-1) = ap E(P»P.Y i-1) + app BPAY piped PV 4) e ag E(Y—iest1PrY its) a Ory E(Yit_1PY i¢_1) 
+ Opy E(PrY x1 Pr t—1) 


We need to recover all of the moments containing P,. The issue is that P,. is not observed, so 
next we discuss how to use our measures of practice to recover these moments. 

We assume that we have at least 3 demeaned measures of practice following equation 5, such 
that 

Pie = O¢P3 + Ujets 

where k = {1,...,k} and kK > 3. We focus the measurement equation around the mean reports 
for each subdomain, calculated over multiple videos and video raters, though we could apply ad- 
justments to the individual level observations as well. Then, applying a normalization, 6; = 1, we 


have 
Cov(Pjnt, Pimt) _ bndmV (P;) = 


= — Ooms 
Cou Pia Pap) nV (P;) 


for n,m #1 and n 4m, thus permitting us to recover the parameters 69, ...,6,. Notice further that 


E( PytePjnt) = bn E(P?), tor Tse I 
and E (P?) is thus identified and similarly, 
E( Pig Pag) = on hh); for ne 1, 


given that measurement error is also uncorrelated across measures after removing randomization 
block fixed effects. Note that E(P;) = 0. 


Al 


We can use our anchor measure then to recover 


————_ 


El Pail ee a = E(P,Y Y _ict-1) 
E( Pag Vip 1)= E(P,Y Yad) 
B(YaP,it) = EOP, . 
E(VitPeuY a1) = ae eT) 
E(YiPrY ~iot1) = EV P,Y ~ioyt1) 


But to recover terms which have higher order products of P, such as E(P,Y ,—1P,-) we rely on 
the ratio of covariances to first recover 69. We can then use our anchor measure and measurement 
two to recover 


PY @.4P. ‘ 
p( Ae) — RY ah) 


Specifically, in estimation we pick an anchor measurement, P,, and use it to construct the terms 
in equation (6). To construct rows two, four and six in the system (7) we multiply equation (6) by 
Prot ProtY ~icet-1 d Pr2tY itg—1 

62? op) se 2 
when multiplying through and then divide by the measurement parameter we’ve recovered. 

Estimation of the parameters from these moments is then straightforward. We recover the 
relevant moments from the measurement model and then plug them into the system defined in 7 
and solve this system for the structural parameters. We can bootstrap standard errors, clustering 
at the randomization block level. Note that because we are overidentified, we can also test the 
robustness to using different measures as our anchor. 

Appendix Table 11 shows results when we correct for measurement error by following two 
strategies. First, we present findings when we implement the Hausman et al. (1991) method 
described above, but we also report (for completeness) specifications when we instrument a given 
measure of a teaching practice at t — 1 (e.g. creating an environment of respect and rapport when 
considering the broad category classroom management) with the remaining teaching practices at 
t—1 (e.g. managing student behaviors and classroom procedures). Overall, results indicate that 
taking averages across measurements that correspond to a specific broad teaching practice (i.e. 
classroom management or challenge/student-centered) lead to similar results that when we correct 
for measurement error by following other methods. 


and then take expectations. Note that we use measurement two 


C.3 Choosing Practices that Matter 


A tension in using our composite measures of teaching practice is that they do not provide as 
fine-grained prescriptive evidence as desirable on what practices matter most in different settings, 
which arguably is consistent with the formative underpinnings of the FFT with its eight separate 
subdomains. With this in mind, we present in Table 12 results at the subdomain level in order to 
complement the evidence from the aggregated subdomains, particularly mirroring results in column 
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(2) of Table 4 for each subdomain separately. We offer two notes of caution when interpreting these 
results. First, a higher degree of measurement error should bias interactions toward zero. Second, 
the subdomains are highly correlated as revealed by the exploratory factor model. 

A main pattern we see in this table is that there is a positive interaction with average peer 
prior achievement with all the subdomains that aggregate to make up classroom management (the 
first 3 columns of Table 6, Panel A), i.e., creating an environment of respect and rapport (CERR), 
managing classroom procedure (MCP) and managing student behaviors (MSB).*° Each of these 
subdomains shows a positive interaction with average peer achievement. Managing student behav- 
ior (MSB) is the largest, but not statistically significantly different from the other subdomains. 
Teachers with high levels of MSB are characterized by establishing clear expectations for student 
conduct and by implementing them efficiently. This suggests that peer effects are amplified by 
teachers that can preempt misbehavior in the classroom. The two other subdomains, MCP and 
CERR, are linked to teachers’ skills in managing more general aspects of the classroom environ- 
ment, including instructional groups, transitions and teacher/student interactions. The significant 
positive interactions with average prior achievement suggests that there are a number of interrelated 
practices beyond just limiting disruptive behaviors, which create an environment where students 
can benefit more from having higher-achieving peers. 

Second, across the board the five subdomains which make up challenge/student-centered prac- 
tice exhibit negative interactions with class IQR. These include establishing a culture of learning 
(ECL), engaging students in learning (ESL), using questioning and discussion techniques (USDT), 
using assessment in instruction (UAT) and communicating with students (CS). Among these, com- 
municating with students has the largest negative coefficient but also the highest standard error. 
The definition of these rubrics are closely related to promoting student active participation in the 
class as a key element of the learning process. More detailed consideration of the rubrics also 
reveals significant emphasis on challenging students in the different subdomains. Our findings indi- 
cate that the benefit of these practices are largely dependent on the heterogeneity in classroom prior 
achievement. Basically, promoting discussion among students may not constitute a good learning 
tool when all students cannot share a somewhat similar level of understanding on key concepts. 
For example, large heterogeneity in classroom achievement is likely to require different levels of 
complexity in the discussion, making the learning process more complicated. Likewise, it may be 
difficult to challenge all students when there there is a great deal of heterogeneity in background. 


“>See Table 6 in the appendix for the definition of each subdomain. 
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D Appendix Tables 


Table 6: Description of Framework for Teaching (FFT) 


Classroom Management Practices 


Managing student be- 
haviors (MSB) 


Managing classroom 
procedures (MCP) 


Creating an environ- 
ment of respect and rap- 
port (CERR) 


Monitoring of student behavior, response to student misbe- 
havior, expectations 


Management of instructional groups, transitions, and mate- 
rials and supplies 


Teacher interactions with students and student interactions 
with each other 


Challenge/Student-Centered Practices 


Establishing a culture of 
learning (ECL) 
Communicating with 


students (CS) 


Engaging students in 
learning (ESL) 


Using assessment in in- 
struction (UAT) 


Using questioning and 
discussion techniques 
(USDT) 


Importance of content and expectations for learning and 
achievement 


Expectations for learning, directions and procedures, expla- 
nations of content, use of oral and written language 


Activities and assignments, grouping of students, instruc- 
tional materials and resources, structure and pacing 


Assessment criteria, monitoring of student learning, feed- 
back to students, student self-assessment and monitoring of 
progress 


Quality of questions, discussion techniques, student partici- 
pation 
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Table 7: FFT Teaching Practice Correlations and Factor Loadings 


CERR 
MCP 
MSB 
USDT 
ECL 
CS 
ESL 
UAI 
Obs. 


CERR 


1 
0.602"** 
0.676*** 
GATE 
0.627°** 
0.568*** 
0.489*** 
0.462*** 


MCP 


1 
C1 
G.413"* 
0.497*** 
0.524*** 
ASQ 
0.468*** 


MSB 


1 
0.395"** 
0.496*** 
0.464*** 
0.415*** 
OAL ** 


USDT 


1 
0.569*** 
0.559*** 
06275" 
0.644*** 


ECL CS ESL 
1 
0.601*** 1 


0.700*** 0.575*** 1 
0.597*** :0.586"** 0.6677" 
732 


Factor 1 
Loadings 


0.196 
0.055 
-0.090 
0.790 
0.699 
0.592 
0.886 
0.826 


Factor 2 
Loadings 


0.680 
0.779 
0.934 
-0.033 
0.170 
0.219 
-0.067 
-0.032 


Notes: First seven columns show correlations between FF'T components. We use the entire sample of fourth and fifth grade 
teachers from both years e.g. 732 teacher-year observations. Last two columns present factor loadings from exploratory 
factor analysis after performing an oblique rotation of the factors, and keeping the first two factors. The first factor explains 
79% of the variance in the data, and the second explains another 13%. CERR (creating an environment of respect and 
rapport), USDT (using questioning and discussion techniques), ECL (establishing a culture of learning), MCP (managing 
classroom procedures), CS (communicating with students), MSB (managing student behaviors), ESL (engaging students 
in learning), UAI (using assessment in instruction). See table (6) for a detailed description of each FFT variable. 
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Table 8: Summary Statistics: Pre-Restricted Sample 


Mean SD Min Max 
Grade Level 4.52 0.50 4.00 5.00 
Joint Math and ELA Class 0.85 0.36 0.00 1.00 
Age 9.46 0.96 vaty 13.20 
Male 0.49 0.50 0.00 1.00 
Gifted 0.08 O27 0.00 1.00 
Special Education 0.09 0.29 0.00 1.00 
English Language Learner 0.15 0.36 0.00 1.00 
White 0.28 0.45 0.00 1.00 
Black 0.34 0.48 0.00 1.00 
Hispanic 0.27 0.45 0.00 1.00 
Asian 0.07 0.26 0.00 1.00 
American Indian 0.00 0.07 0.00 1.00 
Race Other 0.02 0.15 0.00 1.00 
Race Missing 0.01 0.11 0.00 1.00 
Math Score (Year 09-10) 0.11 0.93 -3.14 2.84 
Math Score (Year 10-11) 0.14 0.93 -3.26 3.02 
Unique Districts 5.00 - - - 
Unique Classes 361.00 - - - 
Unique Schools 101.00 - - - 
Unique Randomization Blocks 156.00 - - - 
Unique Teachers 361.00 - - - 
sii a OPIN abi 0.07 0.63 1.00 
tin ee ea in Ran- 0.76 = 0.19 0.03SS«1.00 
ies per Randomization 3.03 1.49 1.00 12.00 
ingaca Block Compli- 0.66 0.40 0.00 1.00 
Observations 5730 


Notes: This sample corresponds to all students in the 2010-11 school year in either a 
fourth or fifth grade Math or Joint Math/ELA course. Since our estimation strategy 
leverages the random assignment of classrooms to teachers, we restrict the sample to 
students with a randomly assigned teacher. No further restrictions are made. Not all 
cells have the same number of observations. 
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Table 9: Balance Tests 


Caeucnaen Challenge / Avg. IQR hee IQR 
Management pide Mecieoes <Tlousyo! Math of | Math of 
Random Cae cua Ob: Assigned Assigned 
Tester Random served served ae aes 
Teacher Peers Peers 
(1) (2) (3) (4) (5) (6) 
Peer Math -0.022 -0.015 
(0.079) (0.119) 
IQR Math 0.036 0.041 
(0.107) (0.103) 
Peer Math Rand -0.089 -0.046 
(0.103) (0.124) 
IQR Math Rand -0.023 0.032 
(0.087) (0.084) 
Math,-1 -0.021 -0.005 0.050 -0.029 0.053 0.006 
(0.020) (0.024) (0.048) (0.036) (0.048) (0.025) 
ELL -0.048 -0.015 -0.200 0.025 -0.197 -0.023 
(0.059) (0.061) (0.129) (0.124) (0.136) (0.091) 
Gifted -0.033 -0.053 0.491** 0.160 0.274 0.230* 
(0.075) (0.144) (0.227) (0.107) (0.175) (0.123) 
Special Educ. O11" 0.089 -0.128* 0.043 -0.055 0.028 
(0.059) (0.057) (0.065) (0.084) (0.055) (0.066) 
Male 0.008 0.002 -0.023 -0.008 -0.035 -0.025 
(0.013) (0.015) (0.019) (0.015) (0.023) (0.018) 
White 0.011 -0.044 0.035 0.011 -0.039* -0.014 
(0.029) (0.032) (0.042) (0.036) (0.023) (0.032) 
Black 0.005 0.001 0.005 0.041 O.0oa"* 0.044 
(0.028) (0.031) (0.046) (0.048) (0.026) (0.048) 
Hispanic -0.059** -0.046 -0.036 -0.029 -0.022 -0.017 
(0.028) (0.034) (0.029) (0.043) (0.028) (0.046) 
Asian 0.087 OAge* 0.070* -0.037 0.053 0.000 
(0.054) (0.055) (0.039) (0.066) (0.044) (0.064) 
American Indian 0.062 0.176 -0.339 0.247 -0.215 0.076 
(0.145) (0.118) (0.205) (0.174) ~— (0.212) ~—-(0.139) 
Race Other 0.070 0.063 -0.083 -0.058 -0.074** -0.044 
(0.066) (0.088) (0.050) (0.048) (0.037) (0.053) 


Notes: We regressed each dependent variable separately on each independent variable with randomization block fixed- 
effects and stacked the parameters from these regressions (i.e. each row—column combination corresponds to a separate 
domly assigned teacher’s practice measured in t — 1 (i.e., Pre-1 
in the present notation). Columns (3) and (4) use the actual classroom composition whereas columns (5) and (6) focus on 


regression). Columns (1) and (2) refers to a student’s r, 


the peers who were initially assigned to be grouped with the student. 


Table 10: Contemporaneous Teaching Practice and Classroom Composition 


ITT IV ITT IV 
Time t — 1 Time t oS Time t 
Practice Practice Practice Practice 


(1) (2) (3) (4) (5) (6) 


Panel A 

Classroom Management 0.008 0.010 0.049* 0.040 
(0.019) (0.018) (0.026) (0.058) 

C.M. x Math,_1 0.011 0.012 0.004 0.026 
(0.012) (0.012) (0.018) — (0.021) 


C.M. x Avg. Peer Math,_ 0.07*** 0.085*** 0.218*** 0.082*** 0.213*** = 0.209*** 
(0.025) (0.022) (0.065) (0.021) (0.062) (0.061) 
C.M. x IQR Peer Math; -0.017 


(0.019) 
First Stage F-Stat.! 28.400 31.563 4.191 
Panel B 
Challenge/Student-Centered 0.018 0.022 0.072*** — -0.006 
(0.020) (0.019) (0.026) (0.127) 
C.S.C x Math, 0.012 0.018 0.008 0.041 
(0.012) (0.012) (0.014) (0.033) 
C.S.C x Avg Peer Math;_1 0.031** 
(0.014) 


OS.C: IQR, Peer Math. ~20.058"" —0.063"" 0.178" 0.060" 0.181" 0.174" 
(0.014) (0.016) (0.052) (0.014) (0.051) —-(0.051) 


First Stage F-Statistic! 26.896 24.538 2.107 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are clustered 
at the randomization block level. Sample size is 2632. Randomly assigned teachers are used throughout. Panel 
A and B correspond to different regressions with math as the dependent variable. These regressions include 
randomization block fixed effects and controls for the level and a squared term of prior math achievement and 
average peer prior achievement, IQR in peer prior achievement, along with the peer variables squared and 
interactions with each other and lagged math achievement. Controls for CKT and student characteristics listed 
in Table 1 also included. + Reports the Kleibergen-Paap rk Wald statistic for a weak instrument test. 
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Table 11: Comparison between the Hausman Estimator and ITT-IV specifications 


a FFT 
MCP ITT MCPIV ).¢,  ESLITT ESL IV aE 
CERR 
(1) (2) (3) (4) (5) (6) 
Teaching Practice 0.004 0.001 0.011 0.028 0.022 0.021 
(0.019) (0.021) (0.027) (0.019) (0.020) (0.031) 
T.P. x Math, 0.009 0.014 0.011 0.001 0.014 0.013 
(0.014) (0.015) (0.022) (0.012) (0.013) (0.021) 
T.P. x Peer Math 0.052***  —0.105*** 0.111*** 0.005 0.043** 0.019 
(0.019) (0.033) (0.039) (0.014) (0.017) — (0.044) 
T.P. x IQR Math —0.035**  —0.004 -—0.008 —0.047*** —0.055***-0.063** 
(0.016) (0.025) (0.033) (0.016) (0.014) (0.029) 
P-value joint signif. T.P. 0.000 0.000 0.038 0.000 
First Stage F-Stat.! 21.7 12.5 
Hansen J P-value’ 0.522 0.643 
p2 load 1.080 0.804 
p3 load 0.859 0.837 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Sample size is 2632. 
Managing student behaviors (MSB), Managing classroom procedures (MCP), Creating an environment of re- 
spect and rapport (CERR), Engaging students in learning (ESL), Using questioning and discussion techniques 
(USDT). The ITT columns uses randomly assigned MCP or ESL scores as “Practice.” The IV columns use 
all other practices that load on classroom management to instrument for MCP, and likewise for ESL with 
challenge/student-centered practices. Practices are for the randomly assigned teacher measured at t — 1. We 
use efficient GMM estimator and FFT MCP-MSB-CERR uses our adapted Hausman estimator to correct 
for measurement error, where MCP is the anchor, and MSB is used to construct moment conditions. FFT 
ESL-USDT is similar but uses the average of all other challenge/student-centered practices as the third mea- 
surement since we are overidentified. The specification is identical to that in Table (4) except here we do 
not include controls for student characteristics. + Reports the Kleibergen-Paap rk Wald statistic. ¢} Reports 
p-value from Hansen’s J statistic test of overidentifying restrictions. “p2 load” and “p3 load” are the recovered 
measurement parameters described in Appendix C.2. Standard errors are clustered at the randomization block 
level, and with the adapted Hausman estimator we bootstrap standard errors with 200 repetitions. 
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Table 12: Individual FFT Subdomain Regressions 


Creating 
environ- Managing Managing Establish 
Panel A ment of classroom student culture of 
respect & procedures behaviors learning 
rapport 
Practice 0.020 0.004 0.003 0.010 
(0.018) (0.019) (0.017) (0.022) 
Practice x Math,_ 0.011 0.009 0.009 0.006 
(0.012) (0.013) (0.013) (0.011) 
Practice x Avg Peer Math;_ 0.057*** 0.052*** 0.072*** 0.049** 
(0.020) (0.019) (0.026) (0.019) 
Practice x IQR Peer Math;_1 -0.019 —0.035** -0.012 —0.040*** 
(0.017) (0.016) (0.019) (0.015) 
P-values Omnt clenitot tees 0.000 0.000 0.002 0.000 
ing practice) 
Engaging a ae Using t Communicating 
Panel B students in ‘T°SVOMInE — assesstnen with 
learning and oe students 
discussion instruction 
Practice 0.028 0.011 0.016 0.014 
(0.018) (0.017) (0.021) (0.018) 
Practice x Math;_1 0.001 0.011 0.020* 0.015 
(0.012) (0.014) (0.011) (0.012) 
Practice x Avg Peer Math;_ 0.005 0.020* 0.037** 0.034** 
(0.014) (0.012) (0.016) (0.014) 
Practice x IQR Peer Math;_; —0.047*** —0.046*** —0.039*** —0.070*** 
(0.015) (0.017) (0.014) (0.026) 
P-value (joint signif. of teach- 0.035 0.017 0.000 0.000 


ing practice) 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are clus- 
tered at the randomization block level. Panel A and B correspond to different regressions with math as the 
dependent variable. Lagged teaching practices are used and sample size is 2632. These regressions include 
randomization block fixed effects and controls for the level and a squared term of prior math achievement, 
average peer prior achievement, IQR of peer prior achievement as well as CKT and student characteris- 


tics listed in Table 1. 
challenge /student-centered. 
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The first 3 subdomains correspond to classroom management, the remainder to 


Table 13: Teaching Practices and Alternative Teacher “Quality” Controls Full Results 


Random IV Actual Random Teacher 
Teacher with Random Alt. Teacher Control: 
‘Teacher CKT 7C PSVY 
(1) (2) (3) (4) (5) 
Classroom Management -0.012 -0.016 -0.014 -0.016 -0.015 
(0.020) (0.022) (0.020) (0.020) (0.019) 
C.M. x Mathy_1 0.004 0.004 0.011 0.004 0.003 
(0.020) (0.021) (0.019) (0.019) (0.019) 
C.M. x Peer Math 0.076** 0.087** O.0ce* 0.076** 0.076" ** 
(0.029) (0.036) (0.030) (0.029) (0.027) 
C.M. x IQR Math 0.026 0.035 0.026 0.026 0.026 
(0.022) (0.026) (0.022) (0.023) — (0.021) 
C.M. 0.026 0.025 0.026 0.026 0.011 
(0.023) (0.025) (0.022) (0.022) (0.024) 
C.M. x Mathy_1 0.010 0.011 0.002 0.016 0.005 
(0.020) (0.021) (0.020) (0.019) (0.019) 
C.M. x Peer Math -0.010 -0.009 -0.010 -0.010 -0.005 
(0.019) (0.022) (0.019) (0.019) (0.019) 
C.M. x IQR Math -0.062***  -0.071*** -0.063*** -0.057**  -0.054** 
(0.017) (0.019) (0.017) (0.021) ~— (0.021) 
CKT -0.007 -0.011 -0.008 -0.006 -0.013 
(0.016) (0.019) (0.016) (0.016) (0.018) 
Alt. Teacher Control -0.006 0.055*** 
(0.019) (0.017) 
T.C. x Math, 0.044*** -0.029**  0.032** 
(0.014) (0.013) (0.013) 
T.C. x Peer Math -0.019 -0.007 -0.016 
(0.018) (0.020) (0.016) 
T.C. x IQR Math -0.012 -0.017 -0.003 
(0.021) (0.021) ~—- (0.016) 
T.C. missing -0.591*** 
(0.142) 
T.C. missing x Math;_1 0.025 
(0.046) 
T.C. missing x Peer Math 0.060 
(0.045) 
T.C. missing x IQR Math 51 0.015 
(0.055) 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Due to the length of this 
table, we’ve split it into three parts to show all parameters. See tables (13), (14) and (15). 


Table 14: Teaching Practices and Alternative Teacher “Quality” Controls Full Results (Continued) 


Random IV Actual Random Teacher 
Teacher with Random Alt. Teacher Control: 
Teacher CKT 7C PSVY 
(1) (2) (3) (4) (5) 
Math;_1 On72a tee © “Ont et ORS. A 2s (gets 
(0.017) (0.017) (0.016) (0.016) (0.018) 
Math?_, -0.043*** -0.043*** SO, G44 = <0:GA2 "0.045" 
(0.012) (0.012) (0.012) (0.012) ~— (0.012) 
Peer Math x Math;-1 -0.000 -0.001 -0.001 -0.003 0.002 
(0.018) (0.018) (0.018) (0.018) (0.018) 
IQR Math x Math;_1 0.034** 0.035"** 0.040""*"° “0.03144 0.044*** 
(0.013) (0.013) (0.013) (0.013) (0.012) 
Peer Math x IQR Math -0.052*** .0.053*** -0.057*** -0.053***  -0.043** 
(0.016) (0.016) (0.017) (0.016) (0.017) 
oe TO Dat ot -0.021 0.019  -0.023 —--0.016 
(0.014) (0.014) (0.015) (0.014) (0.014) 
Peer Math -0.008 -0.008 -0.009 -0.007 -0.012 
(0.026) (0.026) (0.025) (0.026) (0.027) 
Peer Math? -0.010 -0.013 -0.009 -0.009 -0.014 
(0.014) (0.013) (0.014) (0.014) ~—- (0.016) 
IQR Math -0.015 -0.018 -0.017 -0.019 -0.008 
(0.023) (0.024) (0.022) (0.024) (0.026) 
IQR Math? -0.008 -0.009 -0.009 -0.010 -0.002 
(0.013) (0.014) (0.013) (0.014) (0.014) 
ELL 0.008 0.015 0.011 0.007 0.009 
(0.038) (0.039) (0.039) (0.038) (0.038) 
Gifted 0.195***  — 0.188*** OA9a =: OOD 14 Oe 
(0.055) (0.054) (0.054) (0.057) (0.056) 
Male -0.001 -0.004 -0.002 -0.001 -0.001 
(0.021) (0.021) (0.021) (0.021) ~— (0.021) 
Special Educ. -0.111** -0.110** -0.110** -0.112** -0.108** 
(0.043) (0.043) (0.042) (0.044) (0.043) 
Black “O.15(*** > =O. 1597+ -0.148*** -0.154*** -0.156*** 
(0.033) (0.032) (0.033) (0.033) (0.034) 
Hispanic -0.047 -0.051 -0.044 -0.046 -0.049 
(0.035) (0.034) (0.035) (0.036) (0.035) 


e 


Z 
Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Due to the length of this 
table, we’ve split it into three parts to show all parameters. See tables (13), (14) and (15). 


Table 15: Teaching Practices and Alternative Teacher “Quality” Controls Full Results (Continued) 


Random IV Actual Random Teacher 
Teacher with Random Alt. Teacher Control: 
Teacher CKT 7C PSVY 
(1) (2) (3) (4) (5) 

Asian 0:076"* 0.069* 0.082** 0.078** 0.070* 

(0.036) (0.036) (0.036) (0.036) (0.036) 
American Indian -0.045 -0.050 -0.036 -0.045 -0.048 

(0.108) (0.107) (0.107) (0.109) (0.111) 
Race Other 0.013 0.013 0.016 0.016 0.013 

(0.047) (0.046) (0.047) (0.047) (0.047) 
Race Missing -0.040 -0.046 -0.012 -0.041 -0.044 

(0.069) (0.066) (0.073) (0.067) (0.061) 
R-squared 0.649 0.708 0.651 0.650 0.652 
ea job Sigal OF OMe 6.000 0.000 0.000 0.000 0.002 
P-value joint signif of T.C. 0.052 0.172 0.013 
First Stage F-Statistic! 27.717 


Notes: *** denotes significance at the 1%, ** at the 5% and * at the 10% levels. Standard errors are clustered at 
the randomization block level. Sample size is 2632. Dependent variable is math and teaching practices are measured 
at t — 1. Regressions include randomization block fixed effects and controls for the level and a squared term of prior 
math achievement and average and IQR of peer prior achievement, their square and all pairwise interactions of peer 
variables and prior achievement, as well as student characteristics listed in Table 1. Even columns also include the 
IQR in peer prior achievement. + Reports the Kleibergen-Paap rk Wald statistic for a weak instrument test. CKT 
denotes Content Knowledge for Teaching assessment, 7C’ denotes overall student survey teacher ratings based on 
Tripod and PSVY denotes principal assessments of teacher quality. 
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